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ABSTRACT 

We show that for every conjunctive query, the complexity of 
evaluating it on a probabilistic database is either PTIME or 
#P-complete, and we give an algorithm for deciding whether 
a given conjunctive query is PTIME or #P-complete. The 
dichotomy property is a fundamental result on query eval- 
uation on probabilistic databases and it gives a complete 
classification of the complexity of conjunctive queries. 

1. PROBLEM STATEMENT 

Fix a relational vocabulary Ri,...,Rk, denoted 1Z. A 
tuple-independent probabilistic structure is a pair (A,p) where 
A = (A, Ri, . . ., Rk) is first order structure and p is a func- 
tion that associates to each tuple t in A a rational number 
p(t) £ [0, 1]. A probabilistic structure (A,p) induces a prob- 
ability distribution on the set of substructures B of A by: 

fe 

P(B) = 11(11 P(*)* II (!-P(*))) (!) 

» =1 tei?f tenf—Rf 

where B C A, more precisely B = (A, Rf, . . . , Bf?) is s.t. 
Rf C Rf for i = l,k. 

A conjunctive query, q, is a sentence of the form 3x.(ipi A 
. . . Aifm), where each ipi is a positive atomic predicate R(t), 
called a sub-goal, and the tuple t consists of variables and/or 
constants. As usual, we drop the existential quantifiers and 
the A, writing q = <p\, ip%, . . . , <p m . A conjunctive property 
is a property on structures defined by a conjunctive query 
q, and its probability on a probabilistic structure (A,p) is 
defined as: 

p(q) = f( B ) ( 2 ) 

BCA:B|= 5 

In this paper we study the data complexity of Boolean con- 
junctive properties on tuple independent probabilistic struc- 
tures. (When clear from the context we blur the distinction 
between queries and properties). 
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More precisely, for a fixed vocabulary and a Boolean con- 
junctive query q we study the following problem: 

Evaluation For a given probabilistic structure (A, p), com- 
pute the probability p(q) . 

The complexity is in the size of A and in the size of the 
representations of the rational numbers p(t). This problem 
is trivially contained in #P, and we show conditions under 
which it is in PTIME, and conditions where it is #P-hard. 
The class #P [TT] is the counting analogue of the class NP. 

Theorem 1.1. (Dichotomy Theorem) Given any conjunc- 
tive query q, the complexity of Evaluation is either PTIME 
or #P -complete. 

Background and motivation Dichotomy theorems are 
fundamental to our understanding of the structure of con- 
junctive queries. A widely studied problem, which can be 
viewed as the dual of our problem, is the constraint satisfac- 
tion problem (CSP) and is as follows: given a fixed relational 
structure, what is the complexity of evaluating conjunctive 
queries over the structure? Shaefer [10] has shown that 
over binary domains, CSP has a dichotomy into PTIME and 
NP-complete. Feder and Vardi [5| have conjectured that a 
similar dichotomy holds for arbitrary (non-binary) domains. 
Creignou and Hermann [3] showed that the counting ver- 
sion of the CSP problem has a dichotomy into PTIME and 
#P-complete. The problem we study in this paper seems 
different in nature, yet still interesting. 

In addition to the pure theoretical interest we also have 
a practical motivation. Probabilistic databases are increas- 
ingly used to manage a wide range of imprecise data [121 12]. 
But general purpose probabilistic database are difficult to 
build, because query evaluation is difficult: it is both theo- 
retically hard (#P-hard 7, 4 ) and plain difficult to under- 
stand. All systems reported in the literature have circum- 
vented the full query evaluation problem by either severely 
restricting the queries fjQ, or by using a non-scalable (ex- 
ponential) evaluation algorithm [6], or by using a weaker 
semantics based on intervals [8]. In our own system, Mys- 
tiQ [2], we support arbitrary conjunctive queries as follows. 
For queries without self-joins, we test if they have a PTIME 
plan using the techniques in [9]; if not, then we run a Monte 
Carlo simulation algorithm. The query execution times be- 
tween the two cases differ by one or two orders of magnitude 
(seconds v.s. minutes). The desire to improve MystiQ's 
query performance on arbitrary queries (i.e. with self-joins) 
has partially motivated this work. 
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1.1 Overview of Results 

We summarize here our main results on the query eval- 
uation problem. Some of this discussion is informal and is 
intended to introduce the major concepts needed to under- 
stand the evaluation of conjunctive queries on probabilistic 
structures. 

Hierarchical queries: For a conjunctive query q, let 
Vars(q) denote its set of variables, and, for x £ Vars(q), 
let sg(x) be the set of sub-goals that contain x. 

Definition 1.2. A conjunctive query is hierarchical!/ for 
any two variables x,y, either sg(x) (~l sg(y) — 0, or sg{x) C 
sg(y), or sg(y) C sg(x). We write 1C5 whenever sg{x) C 
sg(y) and write x = y when sq(x) = sg(y). A conjunctive 
property is hierarchical if it is defined by some hierarchical 
conjunctive query. 

It is easy to check that a conjunctive property is hierarchical 
if the minimal conjunctive query defining it is hierarchical. 
As an example, the query q^cr = R( x )jS(x,y) is hierarchi- 
cal because sg(x) — {R, S}, sg(y) — {S}. On the other 
hand, the query g non _h = R(x), S(x,y),T(y) is not hierar- 
chical because sg(x) = {R, S} and sg(y) — {S,T}. 

In prior work [3] we have studied the evaluation problem 
under the following restriction: every sub-goal of q refers to 
a different relation name. We say that q has no self-joins. 
The main result in [4], restated in the terminology used here, 
is: 

Theorem 1.3. Assume q has no self joins. Then: (1) 
If q is hierarchical, then it is in PTIME. (2) If q is not 
hierarchical then it is #P-hard. 

Moreover, the PTIME algorithm for a hierarchical query is 
the following simple recurrence on query's structure. Call a 
variable x maximal if for all y, y □ x implies x □ y. Pick 
a maximal variable from each connected component of the 
query to obtain the set xi, . . . , x m . Let fo, fi (xi), . . . , f m {x m ) 
be the connected components of q: fo contains all constant 
sub-goals, and fi(xi) consists of all sub-goals containing Xi 
for i = 1, m. Then: 

p(«) = p(/o)- II (i-LK 1 -^/^]))) (3) 

i — l,m a£A 

This formula is a recurrence on the query's structure (since 
each fi[a/xi] is simpler than g) and it is correct because 
fi[a/xi] is independent from fj[a'/xj] whenever i 7^ j or 
a 7^ a'. As an example, for query q^ier = R( x ), S(x, y), 

P(q) = 1 - EUaC 1 - pW<»))(1 - n 6S A(l - P{S(a, 6))))). 

In this paper we study arbitrary conjunctive queries (i.e. 
allowing self-joins), which turn out to be significantly more 
complex. The starting point is the following extension of 
Theorem fT3] (2) (the proof is in the appendix): 

THEOREM 1.4. If q is not hierarchical then it is #P-hard. 

Thus, from now on we consider only hierarchical conjunctive 
queries in this paper, unless otherwise stated. 

Inversions: As a first contact with the issues raised by 
self-joins, let us consider the following query: 

q = R(x),S(x,y),S(x ,y'),T(x') 

We write it as q — fi{x)f2(x'), where fi(x) = R(x), S(x,y) 
and fiix') — S(x', y'), T(x'). The query is hierarchical, but 
it has a self-join because the symbol S occurs twice: as a 



consequence fi[a/x] is no longer independent from f2[a/x'] 
(they share common tuples of the form S(a,b)), which pre- 
vents us from applying Equation (J3J) directly. Our approach 
here is to define a new query by equating x — x' , fs(x) = 
fi( x )h(x) = R(x), S(x,y),S(x,y'),T(x) which is equiva- 
lent to R(x),S(x,y),T{x). We show that the probability 
p(q) can be expressed using recurrences over the probabili- 
ties of queries of the form f\\a/xi], fi\a! Ixi\, fz[a" /xs], as 
a sum of a few formula^] in the same style as ([3]) (see Exam- 
ple [ITHJ. The correctness is based on the fact that fi[a/xi] 
and fj\a'/xj] are independent if i 7^ j or a 7^ a'. 

However, this approach fails when the query has an "in- 
version". Consider: 

Ho = R(x),S(x,y),S(x',y'),T(y') 

This query is hierarchical, but the above approach no longer 
works. The reason is that the two sub-goals S(x,y) and 
S(x',y') unify, while x □ y and x' C y': we call this an 
inversion (formal definition is in Sec. I2.2[l . If we write Ho as 
fi(x)f2(y') and attempt to apply a recurrence formula, the 
queries fi[a/x] and fi[a' /y'] are no longer independent even 
if a 7^ a , because they share the common tuple S(a, a'). 
Inversions can occur as a result of a chain of unifications: 

Hk = 

R(x),S (x,y), 

So(ui,vi),Si(ui,Vi) 

5 , l(u 2 ,W2),. • • 

Sk-l{uk,Vk),Sk(uk,V k ) 

S k (x',y'),T(y') 

Here any two consecutive pairs of variables in the sequence 
x □ y, Ui = vi, U2 = V2, ■ ■ ■ , x' C y' unify, and we also call 
this an inversion. We prove in the Appendix: 

Theorem 1.5. For every k > 0, Hk is #P-hard. 

Thus, some hierarchical queries with inversions are #P-hard. 
We prove, however, that if q has no inversions, then it is in 
PTIME: 

Theorem 1.6. If q is hierarchical and has no inversions, 
then it is in PTIME. 

The PTIME algorithm for inversion-free queries is a sum 
of recurrence formulas, each similar in spirit to ©. The 
proof is in Sec. 13.21 

Erasers The precise boundary between PTIME and #P- 
hard queries is more subtle than simply testing for inver- 
sions: some queries with inversion are #P-hard, while others 
are in PTIME, as illustrated below: 

Example 1.7 Consider the hierarchical query q 

q =R{r, x),S(r, x,y),U (a, r), U (r, z),V(r, z) 
S(r\x',y'),T(r',y'),V(a,r') 
R(a, b), S(a, b, c), U(a, a) 

1 This particular example admits an alternative, perhaps 
simpler PTIME solution, based on a dynamic programming 
algorithm on the domain A. For other, very simple queries, 
we are not aware of a ny algorithm that is simpler than ours 
(formula ([9]), Sec. l3.2[) . for example R(x,y,y, x),R(x, y,x,z), 
or R(y,x,y,x,y),R(y,x,y,z,x),R(x,x,y,z,u) (both are in 
PTlME because they have no inversions). To appreciate 
the difficulties even with such simple queries note that, by 
contrast, R(y, x, y, x, y),R(y, y, y, z, x),R(x, x, y, z, u) is #P- 
hard. For additional challenging PTIME queries, see Fig.Q] 
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Here a, b, c are constants and the rest are variables. This 
query has an inversion between x □ y and x' C y' (when 
unifying S(r,x,y) with S(r', x', y')). Because of this inver- 
sion, one may be tempted to try to prove that it is #P-hard, 
using a reduction from Hq. Our standard construction starts 
by equating r = r' to make q "like" Ho: call q' the resulting 
query (i.e. q' = q[r/r']). If one works out the details of the 
reduction, one gets stuck by the existence of the following 
homomorphism from h : q —> q' that "avoids the inversion" : 
it maps the variables r,x,y, z,r' ,x' ,y' to a, b, c, r, r, x' , y' re- 
spectively, in particular sending U(r, z),V(r, z) to U(a,r), 
V(a,r). Thus, h takes advantage of the two sub-goals U(a,r), 
V(a, r) in q' which did not exists in q, and its image does not 
contain the sub-goal S(r, x, y), which is part of the inversion. 
We call such a homomorphism an eraser for this inversion: 
the formal definition is in Sec. 12.31 Because of this eraser, 
we cannot use the inversion to prove that the query is #P- 
hard. So far this discussion suggests that erasers are just 
a technical annoyance that prevent us from proving hard- 
ness of some queries with inversions. But, quite remarkably, 
erasers can also be used in the opposite direction, to de- 
rive a PTIME algorithm: they are used to cancel out (hence 
"erase") the terms in a certain expansion of p(q) that cor- 
respond to inversions and that do not have polynomial size 
closed forms. Thus, our final result (proven in Sections [3] 
and[4j is: 

Theorem 1.8 (Dichotomy). Let q be hierarchical. 

(1) If q has an inversion without erasers then q is #P-hard. 

(2) If all inversions of q have erasers then q is in PTIME. 

As a non-trivial application of (1) we show (Fig. [2]in Ap- 
pendix [X] and in Example 14. ip that each of the following 
two queries are #P-hard, since each has an inversion be- 
tween two isomorphic copies of itself: 

<?2 P ath = R(x,y),R(y,z) 

<7markod-ring = R(x),S(x, y), S(y, x) 

In general, the hardness proof is by reduction from the query 
Hk, where k is the length of an inversion without an eraser. 
The proof is not straightforward. It turns out that not every 
eraser-free inversion can be used to show hardness. Instead 
we show that if there is an eraser-free inversion then there 
is one that admits a reduction from Hk- 

The PTIME algorithm in (2) is also not straightforward at 
all. It is quite different from the recurrence formula in Theo- 
rem ll.6l since we can no longer iterate on the structure of the 
query: in Example 11.71 the sub-query of q consisting of the 
first two lines is #P-hard (since without the third line there 
is no eraser), hence we cannot compute it separately from 
the third line. Our algorithm here computes p(q) without 
recurrence, and thus is quite different from the inversion-free 
PTIME algorithm, but uses the latter as a subroutine. 

2. AN EXPANSION FORMULA FOR CON- 
JUNCTIVE QUERIES 

In this section, we introduce the key terminology and 
prove an expansion formula for computing the probability of 
conjunctive queries that will be used to device PTIME algo- 
rithms for query evaluation. For the remainder of the paper, 
all queries are assumed to be hierarchical, as we know that 
non-hierarchical queries are #P-hard ( Appendix [Bl). 



2.1 Coverage 

We call an arithmetic predicate a predicate of the form 
u — v, u ^ v, or u < v between a variable and a constant 
in C, or between two variable^. A restricted arithmetic 
predicate is an arithmetic predicate that is either between a 
variable and a constant, or between two variables u, v that 
co-occur in some sub-goal (equivalently u □ v or u IZ v). 
From now on, we will allow all conjunctive queries to have 
restricted arithmetic predicates. 

Definition 2.1. A coverage for a query q is a set of con- 
junctive queries C = {qci, . . . , qc n } such that: 

q = qci V . . . V qc n 

Each query in C is called a cover. A factor of C is a con- 
nected component of some qa £ C. We denote the set of all 
factors mC by T ' = {/i, . . . , f k }- 

We alternatively represent a coverage by the pair (T,C), 
where T is a set of factors and C is a set of subsets of 
T . Each element of C determines a cover consisting of the 
corresponding set of factors from T . 

For any query q the set C = {q} is a trivial coverage. We 
also define C < (q), which we call the canonical coverage, ob- 
tained as follows. Consider all m pairs (u, v) of co-occurring 
variables u, V in q, or of a variable u and constant v. For 
each such pair choose one of the following predicates: u < v 
or u = v or u > v, and add it to q. This results in 3 m 
queries. Remove the unsatisfiable ones, then remove all re- 
dundant ones (i.e. remove qd if there exists another qcj s.t. 
qci C qcj). The resulting set C < (q) = {qc\, . . . ,qc n } is the 
canonical coverage of q. 

Unifiers 

Let q, q' be two queries (not necessarily distinct). We re- 
name their variables to ensure that Vars(q) n Vars(q') = 0, 
and write qq 1 for their conjunction. Let g and g' be two 
sub-goals in q and q' respectively. The most general unifier, 
MGU, of g and g (or the MGU of q, q' when g,g' are clear 
from the context) is a substitution 9 for qq' s.t. (a) 9(g) = 
6(g'), (b) for any other substitution 9' s.t. 0'(g) = O'(g') 
there exists p s.t. p o 9 — 6' . 

A 1-1 substitution for queries q, q' is a substitution 9 for qq' 
such that: (a) for any variable x and constant a 9{x) 7^ a, 
and (b) for any two distinct variables x,y m q (or in q), 
9(x) 7^ 9(y). The set representation of a 1-1 substitution 9 
is the set {(x,y) | x £ Vars(q),y € Vars(q'),9(x) = 9(y)}. 

Definition 2.2. An MGU 9 for two queries q, q' is called 
strict if it is a 1-1 substitution for qq' . 

For a trivial illustration, if q — R(x, x, y, a, z) and q — 
R(u,v,v,w,w) and their MGU is 9, then 9(x) = 9(y) = 
9(u) — 9(v) — x' , 9(w) = 9(z) — a, and the effect of the 
unification is 9(qq') — R(x' , x' , x' , a, a). This is not strict: 
e.g. 9(x) = 9(y) and also 9(z) = a. We want to ensure that 
all unifications are strict. 

Definition 2.3. (Strict coverage) Let C be a coverage and 
T be its factors. We say that C is strict if any MGU between 
any two factors f, f 6 J- is strict. 

2 As usual we require every variable to be range restricted, 
i.e. to occur in at least one sub-goal. 
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Example 2.4 Let q = T(x), R(x, x, y), R(u, v, v). The triv- 
ial coverage C = {q} is not strict, as the MGU of the two 7? 
sub-goals of q equate x with y and u with v. Alternatively, 
consider the following three queries: 

qci = T(x),R(x,x,x) 

qc2 = T(x),R(x,x,y),R(u,u,u),x =fc y 

qC3 = T(x), R(x, x, y), R(u, v, v), x / y, u 7^ v 

One can show that q = qci\/qc2\/qc3, hence C = {qci, qc2, qc^} 
is a coverage for q. The set of factors J- consists of the con- 
nected components of these queries, which are 

h = T(x),R(x,x,x) f a = T(x),R(x,x,y), x / y 
fz = R(u,u,u) fi = R(u,v,v),u 7^ v 

and C = {{/1}, {/2, /3}, {/2, A}}- The coverage is strict, as 
a unifier cannot equate x with y or u with v in any query 
because of the inequalities. Similarly, the canonical coverage 
C < (q), which has nine covers containing combinations of x < 
y, x = y, or x > y with u < v, u = v, u > v, is also strict. 

Lemma 2.5. T/ie canonical coverage C K (q) is always strict. 

2.2 Inversions 

Fix a strict coverage C for q, with factors T , and define the 
following undirected graph G. Its nodes are triples (f,x,y) 
with f £ T and x,y £ Vars(f), and its edges are pairs 
((f,x,y),(f',x',y')) s.t. there exists two sub-goals g,g' in 
/, /' respectively whose MGU satisfies 8(x) = #(2;') and 
= 8(y'). We call an edge in G a unification edge, and a 
path a unification path. Recall that for a preorder relation 
□ , the notation x □ y means x^\y and x%y. 

Definition 2.6. (Inversion-free Coverage) An inversion 
in C is a unification path from a node (/, x, y) with x □ y to 
a node (f',x',y') with x' C y' . An inversion-free coverage 
is a strict coverage that does not have an inversion. We say 
that q ts inversion-free if it has at least one inversion-free 
coverage. Otherwise, we say that q has inversion. 

Obviously, to check whether C has an inversion it suffices 
to look for a path in which all intermediate nodes are of the 
form (f",u,v) with u = v, i.e. the □ and C are only at 
the two ends of the path. The following result says that to 
check if a query has an inversion, it is enough to examine 
the canonical coverage. 

Proposition 2.7. If there exists one coverage of q that 
does not contain inversion, then the canonical cover C < (q) 
does not contain inversion. 

Example 2.8 We illustrate with two examples: 

(a) Consider H k in Theorem 11.51 The trivial coverage 
C — {Hk} is strict, and has factors T = {fo, fi, ■ ■ ■ , fk+i} 
(each line in the definition of Hk is one factor). The follow- 
ing is an inversion: (f ,x,y), iti, (fk,u k ,v k ), 
(fk+i,x' ,y'). This is an inversion because x □ y and x' □ 
y . The canonical coverage C < also has an inversion, e.g. 
along the factors obtained by adding the predicates x < y, 
Ui < vi, . . . , u k < v k , x' < y'. 

(b) Consider the query q = R(x), S(x,y), S(y,x). The 
trivial coverage C — {q} is strict, has one factor T = {q}, 
and there is an inversion from (q, x, y) to (q, y, x) because 
S(x,y) unifies with S(y,x) (recall that we rename the vari- 
ables before the unification, i.e. the unifier is between R(x), 



S(x,y) , S(y,x) and its copy R(x'), S(x', y 1 ), S(y', x 1 ) ). In 
the canonical coverage C < there are three factors, corre- 
sponding to x < y, x = y, and y < x, and the inversion 
is between x < y and y < x. 

2.3 An Expansion Formula for Coverage 

Given a conjunctive query q and a probabilistic structure 
A = (A, Ri , . . . , iijf ), we want to compute the probabil- 
ity p(q). Our main tool is a generalized inclusion-exclusion 
formula that we apply to the coverage of a query. 

Definition 2.9. (Expansion Variables) Let C = (JF, C) 
be a strict coverage, where T = {/1, ■ • • , fk} is a set of fac- 
tors and C is a set of subsets of T. A set of expansion 
variables is a set x = {xf x , • ■ ■ , xf k } such that 

1- Xfi ^ Vars(fi) for 1 < i < k. 

2. If x £ Xf and x C y, then y G Xf . 

3. Any MGU of any two factors fi and fj equates an 
expansion variable to an expansion variable. 

We use (JF, C, x) to denote a coverage where we have cho- 
sen the expansion variables. 

Definition 2.10. (Unary coverage) A coverage (T,C,x) 
is called a unary coverage if for each f € F, xj consists of 
a single variable r-f. We call rj the root variable in f . 

By definition of expansion variables, the root variable must 
be the maximal element under C order, i.e. must occur in 
all the sub-goals of the corresponding factor. 

Our first PTIME algorithm (for inversion-free queries) uses 
a unary coverage: the discussion in the next few subsec- 
tions is much easier to follow if one assumes all coverages 
to be unary. Our second PTIME algorithm (for queries with 
erasable inversions) uses a coverage in which all variables 
are expansion variables, i.e. Xf = Vars(f): for that reason 
our discussion below needs to be more complex. 

For / G T, let A f = A 18 / 1 , and for a G A/, let /(a) denote 
the query f[a/xf], i.e., the conjunctive query obtained by 
substituting the variables Xf with a. The following follows 
simply from the definitions: 

1 = V A V /(a) (4) 

Our next step is to apply the inclusion/exclusion formula 
to Q. We need some notations. We call a subset a C T a 
signature. Given sCC, its signature is sig(s) = Uces c - 

Definition 2.11. Given a set a C T , define 

sOC:sig(s) — a 

For example, if C = {ci,c 2 ,C3} where a = {fx, fa},C2 = 
{fa, h} and c 3 = {/1, / 3 }, then for signature a = {fi,f 2 , /a} 
wehaveiV(cr) = (-ijK/i./a./all^.^Kci.call+^^K^.cs}^ 

f_]\\{c2,C3}\ _|_ (_1)I{ C 1. C 2.C3}I) _ _2 

Given k sets T = {T fl _, . . . , T fk }, where T fi C A h , we 
denote its signature sig(T) = {/ | T/ 7^ 0}, its cardinality 
\T\ = Ei I T ft \, and denote T(T ) the query A /e ^ A aeTf f(fl) 
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Definition 2.12. (Expansion) Given a coverage C, de- 
fine its expansion as 

Ex P (C) = ^N(C,sig(T))(-lf^(T)) (5) 

T 

We prove the following in the appendix, using the inclu- 
sion/exclusion formula on Q: 

Theorem 2.13. (Expansion Theorem) If C is a coverage 
for q, then p(q) = Exp(C). 

Of course, Equation ([5]) is of exponential size. To re- 
duce it, our first goal is to express p(T(T)) as the product 
Ylf Tia£Tj ?(/(**))• F° r that we need to ensure that any two 
queries /(a), a £ Af and f'(a), a 6 Ari are independent, 
and this does not hold in general. We will enforce this by 
restricting the sets T in Eq. ((5} to satisfy some extra condi- 
tions, which we call independence predicates. We first illus- 
trate independence predicates on a running example, then 
present them in the general case. Then we will move to our 
second goal: finding a closed form for the sum of products. 

2.4 Running Example 

We give the basic intuition for independence predicates 
using the following example. 

Example 2.14 Consider the following query 

q = P(x),R(x,y),R(x',y'),S(x') 

and a coverage C = ( J-, C, x) where T consists of the follow- 
ing three queries: 

fi = P(x 1 ),R(x 1 ,y 1 ) 
h = R(x2,y 2 ),S(x 2 ) 
f 3 = P(x 3 ),R(x 3 ,y 3 ),S(x 3 ) 

and C — {{fi , f 2 }, {/ 3 }} and the expansion variables are 
Xf 1 = {xi},Xf 2 = {x 2 },Xf 3 = {£3}. It is easy to verify 
that C defined here is indeed a coverage. (Here fs is redun- 
dant, i.e. {{/i,/2}} is already a coverage. The reason why 
we include / 3 will become clear later.) The function N on 
signatures is as follows: N(C,{fi,f 2 }) = 1, N(C, {/ 3 }) = 
N(C,{fi,f 2 ,f 3 }) = -1 and N{C,a) = for a ll oth er a. 
Thus, the inclusion-exclusion formula in Theorem l2.13l gives : 

T 

where f _is a triplet of sets {Ti,T 2 ,T 3 }, |T| = [Ti| + [T 2 | + |T 3 | 
and T(T) = /i(Ti)/ 2 (T 2 )/ 3 (T 3 ). Consider now three sets 
T!,T 2 ,T S , and let's examine the query T(T). If Ti n T 2 = 
Ti n T 3 = T 2 n T 3 = then f l (a) is independent from fj (a'), 
for all i / j, or for i = j and a 7^ a' . In this case p(T(T) 
is a product I"Ij=i 3 Yl a e 4 P(/ i ( a ))- We will ensure that the 
sets Ti are disjoint in two steps. First we will show: 

P (q) = J2 N(C,sig(T))(-lf\p(F(T)) (7) 

T\T 1 nT 2 =il 

Starting from Eq.© we note that N(C,sig(T)) is / for 
only three signatures, hence p(q) = pi + p 2 + P3, where 

P2 = -Et 1 =T 2 =0,T 3 ^0(- 1 ) |T| P(^)) 
P3 = ~T,T 1 ^,T 2 ^0,T 3 ^<i(- 1 ) m P( :F ( T )) 



Let p{ and p 3 denote the same sums as p\ and p 3 , but 
where T is restricted to satisfy T\ n T 2 = 0- To prove Equa- 
tion ©, all we need is to show is that pi+p 3 = pl+pi- In the 
sum defining p 3 denote T 3 = T 3 - T a n T 2 , T 3 " = T 3 n T a n T 2 
(hence T 3 = T 3 U T 3 ") and T' = (Ti, T 2 , T 3 ). We have p 3 = 

- - E E (-i) |f| p(^(r)) 

T' | 7\ # 0, T 2 # T3' C Ti n T 2 

T3 n Ti n T 2 = T3 u T3 # 

= - E (-i) i?,| p(^(t')) E (-i) 13 *' 1 

T' I Ti 0, T 2 ^ T3" C Ti n T 2 

T3 n Ti n T 2 = T3 u T3" ^ 

= pi + + (p'i - Pi ) 

The first line simply splits the summation into a sum 
where Ti,T 2 ,T 3 range over subsets of A, and an inner sum 
where T 3 ' ranges over subsets of Ti D T 2 . The second line 
holds because the query T(T) = ,fi(T 1 )/ 2 (T 2 )/ 3 (^)/ 3 (T 3 ") 
is logically equivalent to /i(Ti)/ 2 (T 2 )/ 3 (T 3 ) since Va G T 3 ' 
/ 3 (a) is /i(a)/ 2 (a) and a is in both T\ and T 2 . The last line 
follows by breaking the sum into three disjoint sums: 

1. Ti DT 2 = 0. Then, T 3 is only allowed to be the empty 
set and the inner sum is 1. The total contribution of 
such terms is exactly equal to p 3 . 

2. TiDT 2 / 0, T 3 / 0. Then the inner sum, J2 T " (-1) |T * 1 
is 0, because T 3 ' ranges over all subsets of Ti n T 2 . 

3. Ti n T 2 / 0, T 3 = 0. Then the inner sum is -1, because 
T 3 ' ranges over all subsets of Ti n T 2 except 0. The 
total contribution is p\ —p\. 

Thus, we have shown Equation ([7}. Next, we introduce sim- 
ilar predicates between Ti , T 3 , and T 2 , T 3 . This turns out to 
be much simpler: we write Ti as T[ U T" where T[ = Ti -T 3 
and T" = Ti nT 3 . Similarly, we write T 2 as T 2 UT 2 ' with T 2 = 
T 2 - T 3 and T 2 ' = T 2 n T 3 . The query /i(Ti)/ 2 (T 2 )/ 3 (T 3 ) 
is logically equivalent to fi(T{) f '2(^2) /s(T 3 ) since both /1 
and / 2 have a mapping to / 3 . We now have independence 
predicates between T/ and T 3 and T 2 and T 3 . We replace T 
withT' = (T{,Ti,n,T{',Ti'). Denoting ip(T') = (T{nTi = 
n T 2 ' = T{ n T3 = T2 n T 3 = 0, Tf C T 3 ',T^ C T?), we 
have: 

p(q) = E iV(C, S i5(T'))(-l) |f| p(^(T')) 

ip(T') 

= E N(C,sig(T'))(-l)W HK/<W) (8) 

ip(T') t=l,8o£T 4 ' 

Note that the summation is over five sets T[,T^T^T" ,Tg 
but only T^T^T^ are used in the compuation of p. The 
independence predicate ip allowed us to express p(T(T)) as 
a product. We will show later how to compute this sum. 
First, we need to show how to derive and use independence 
predicates in general. □ 

2.5 Independence Predicates 

Our goal in this section is to define formally indepen- 
dence predicates. For unary coverages, an independence 
predicate is simply a statement Ti DTj 7^ 0, but the non- 
unary case requires more formalism. We first introduce a 
new relational vocabulary, T consisting of the relation sym- 
bols Tf x , ■ ■ ■ , Tf k of arities \xf ± |, . . . , \x/ h | respectively. A 
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structure over this vocabulary is a fc-tuple of sets T; given 
a conjunctive query <j> over the vocabulary T, T \= cj> means 
that (f> is true on T. For a trivial illustration, assume Tf 1 , 
Tf 2 to be of arity 1, and cf> — Tf 1 (x),Tf 2 (x). Then </> states 
that T h n T h 0. 

Suppose we have have two factors fi and fj and 9 is any 
1-1 substitution on fi,fj, given in set representation, such 
that for all (xi, Xj) £ 9, Xi is an expansion variable of fi and 
Xj is an expansion variable of fj . Define 

d H (fi,fj) = fufs, A x * = x s 

S T {fi,fj) = T h (x }i ),T fj (x fj ), A x i = x i 

( x i , x j ) £ 

Note that 9 R (fi,fj) is over the vocabulary TZ (same as the 
original query q), while 9 T (fi, fj) is over the vocabulary T. 
We call them the join query and the join predicate respec- 
tively. We call the negation of join predicate, not(6 T (fi, fj)), 
an independence predicate. 

Example 2.15 Consider factors f\ and f 2 in Example l2.14l 
and lets = {(xi,x 2 )}. Then, 9 R (fi,f 2 ) = P(x),R(x,y),S(x), 
8 T (fi,fj) = Ti(x), T 2 (x), and the independence predicate 
not (9 T (fi, fj)) says that Ti and T 2 are disjoint. 

The key property of independence predicates is the fol- 
lowing: If Ti,Tj satisfy all independence predicates between 
fi and fj, then for all a £ Ti and a' £ Tj, fi(a) and fj(a') 
are independent. 

2.6 Hierarchical Closure 

Recall from Example 12. 141 that, in order to introduce an 
independence predicate between two sets Ti , T 2 we needed 
to use the join query of their factors, fz(x) = fi(x), f 2 (x). 
In general, the join query between two factors in T is not 
necessarily in T (fz was redundant in Example 1 2.1 4 \ . Thus, 
we will proceed as follows. Starting from a coverage C we will 
add join queries repeatedly until we obtain its hierarchical 
closure, denoted C* , then we will introduce independence 
predicates. Computing C* is straightforward when C is an 
inversion-free coverage (which is the case for our first PTIME 
algorithm) , but when C has inversions then some join queries 
are non-hierarchical and we cannot add them to C* . We 
define next C* in the general case. Let C = (J-, C, x) be any 
coverage with a set of expansion variables x. 

Definition 2.16. Given two factors fi and f 2 , with ex- 
pansion variables x f x and x / 2 , and a MGU given by the set 
representation 9, the hierarchical unifer 9 U is the maximal 
subset of 9 such that: 

1. (x,y) e 9 U => x £ x fl ,y e x f2 

2. If (x,y) £ 9 U and (x',y') is such that x C x' or y C y' 
and (x',y') G 9, then (x',y') G 9 U . 

3. The query 9^(f\,f 2 ) is hierarchical. 

It can be shown that 9 U is uniquely determined. If 9 U is 
non-empty, we say that /i and f 2 can be hierarchical joined 
using 9 and call the query 6^(fi, f 2 ) the hierarchical join of 
fi and f 2 , and 9^(fi,f 2 ) the hierarchical join predicate. 



Example 2.17 Let 

h= R(r,x), S(r,x,y),U(a,r),U(r,z),V(r,z) 
H= S(r',x',y'),T(r',y'),V(a,r') 

and 8 = {(r,r'),(x,x'),(y,y')} be the MGU of the two 5" 
sub-goals. Then, the hierarchical unifier is 9 U = {(r, r')}. If 
we include any of (x,x) or (y,y), we will have to include 
the other because x C y and x' □ y' , and then the join will 
not be hierarchical. The hierarchical join for this unifier is 

Ouih,h)= R{r,x), S(r,x,y),U(a,r),U(r,z),V(r,z) 
S(r,x',y'),T(r,y'),V(a,r) 

and the set of expansion variables of the join is {r}. □. 

Starting from the factors T, we construct a set H, a function 
Factors from TL to subsets of T , and a set of expansion 
variables x^ for h 6 Ti. This is done inductively as follows: 

1. For each / 6 J- ' , add / to TL and let Factors(f) = {/}. 

2. For any two queries hi, h 2 in Ti, and any MGU 9 
between hi and h 2 , let h — 9^(hi,h 2 ) be their hier- 
archical join. Then add h to Ti, define Factors(h) = 
Factors(hi) U Factors(h 2 ); define xh = 9 u (xh 1 U Xh 2 )- 

We need to show that Ti is finite. This follows from: 

Lemma 2.18. Given a fixed relational vocabulary 1Z and a 
fixed set of constants C , the number of distinct hierarchical 
queries over TL and C is finite. 

Define T* to be the subset of Ti containing queries that 
are either inversion-free or in T . 

Definition 2.19. (Hierarchical Closure) Given a cover- 
ageC = (J-,C,x), its hierarchical closure isC* — (J-* ,C* ,x*) 
where T* , x* are defined above and: 

C* = {c | c C T*, (J Factors(f) G C} 

fee 

Note that C is indeed a coverage since the set T* contains 
the set T, the set C* contains the set C, and the expansion 
variables satisfy the conditions in Def. 12.91 Let ip(C*) be 
the conjunction of not(jp), where jp ranges over all possible 
hierarchical join predicates in T* . 

Lemma 2.20. IfT\= ip(C*), then 

P(HQ)) = II II P(/(5)) 
fer* aer f 

Finally, we look at conditions under which we can add the 
predicate ip(C*) over T. We divide the join predicates into 
two disjoint sets, trivial and non-trivial. A join predicate 
between factors hi and hj is called trivial if the join query 
is equivalent to either hi or hj, and is called non-trivial 
otherwise. We write ip(C*) as ip n (C*) A ip*(C*), where 
ip n (C*) is the conjunction of not(jp) over all non-trivial 
join predicates jp, and ip'(C*) is the conjunction over all 
trivial join predicates. 

Definition 2.21. (Eraser) Given a hierarchical join jq = 
8u(fi>fj), an eraser for jp is a set of factors E C T s.t.: 
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1. Vq S E, there is a homomorphism from q to jq. 

2. V^Cf, N(C,aU{f i ,f j })=N(C,cTU{f i ,f j }UE). 

Theorem 2.22. Let q be a query such that every hierar- 
chical join query jq = 8^(fi, fj) between two factors in T* 
has an eraser. Then, 

p(q) = N(C*,stg(T))(-l)^p(T(T)) 

T|TMp"(C*) 



The theorem allows us to add all possible non-trivial inde- 
pendence predicates over the summation. If the hierarchical 
join query jp is inversion-free, then it belongs to T* and it 
is its own eraser (i.e. E = {jp} satisfies both conditions 
above). We can use it to separate Ti from Tj. In particular 
if q is inversion- free, then any hierarchical join query has an 
eraser, and all sets can be separated. But if jp has an inver- 
sion, then jp does not belong to T* and we must find some 
different query (queries) in T* that can be used to separate 
Ti from Tj. 

Example 2.23 Let's revisit the query in Example [2T4] We 
had q = P(x),R(x,y),R(x' ,y'), S(x'). Suppose we start 
from the trivial coverage Co = {q}, with two factors To = 
{/i:/2} ( see notations in Example I2.14p . and suppose we 
chose a single expansion variable for /i and / 2 , namely xi, X2 
respectively. Its hierarchical closure adds the join query fs 
between /i and / 2 . The coverage Cq contains the following 
covers: {/i,/ 2 }, {/ 3 }, {fi, fa}, {f2,h} and {/i,/ 2 ,/3}. 

Thus, we have expressed the probability of a query p(q) 
using the sum in Theorem 12.221 This is still exponential in 
size, and now we will show how compute a closed form for 
that sum. Here we will use different techniques for the two 
PTIME algorithm. In the first algorithm (for inversion-free 
queries) the coverage is unary, and all independence pred- 
icates are of the form Ti n Tj ^ 0: here we derive closed 
forms directly. In the second algorithm (for queries with 
erasable inversions) the independence predicates are more 
complex: in this case we will reduce the sum to the prob- 
ability of an inversion-free query <f> over the T vocabulary, 
thus bootstrapping the first PTIME algorithm. 

3. PTIME ALGORITHMS 

In this section, we establish one-half of the dichotomy 
by proving Theorem If .8f 2) . We start by computing simple 
sums over functions on sets, then use it to give a PTIME 
algorithm for queries without inversion and finally give the 
general PTIME algorithm for queries that have erasers for 
all inversions. 

3.1 Simple Sums 

Let A — {1, . . . , N}, g — (gi, . . . ,g^) be k functions gt : 
A -> R, i = 1, . . . , k, and f = (Ti, . . . ,T k ) a fc-tuple of sub- 
sets of A. Denote g(f) = gi(Ti) ■ ■■g k {T k ), where gi (Ti) = 
IlaeT, # T abbreviates ^ T u . . . , / T k . Let 

be a conjunction of statements of the form Ti(~)Tj — or Ti C 
Tj, and define: S4, = {a \ a C [fc],Vi, j £ a, <f> ^ T z n T 2 • = 0} 
n {a I a C [fc],Vi G a,j ^a,^T,C Tj}. 



Definition 3.1. Denote the following sums: 

TCA,4, 

0> - E »cr> 

0^TCA,0 

For a C [k], denote g a the family of functions (gi)i ecr . 
Proposition 3.2. The following closed forms hold: 

= n e n»( a ) 

0> - Ec- 1 )'-" 1 ©* 

aC[ k ] 

Moreover, the expressions above have sizes 0(k2 k N) and 
0(k2 2k N) respectively, hence all have an expression size that 
is linear in N . 

Example 3.3 Consider four functions gi : A — * R, i = 
1, 2, 3, 4, and suppose we want to compute the following sum: 

E 3i(Ti)ff2(T 2 ) fl3 (T 3 ) 5 4(T4) 

T 1 nT 2 =0,T 2 nT3=0,T4CT2 

In our notation, this is ® ,g, where (/> is T1IHT2 = 0AT 2 nT3 = 
A Ti C T 2 . The set S is {0, {1}, {2}, {2, 4}, {3}, {1, 3}}. 
Thus, the expression for the sum is 

rj(l+3x(a) + 52(a) +g 2 {a)g i (a) + 33(a) +51(0)33(0)) 
The size of this expression is 8N, where iV is the size of A. 

3.2 PTIME for Inversion-Free Queries 

Let q be an inversion-free query. We give now a PTIME 
algorithm for computing q on a probabilistic structure. 

Theorem 3.4. If q has no inversions then q has a unary 
coverage. 

This says that we can choose for each factor / a single root 
variable r/ s.t. any MGU between two (not necessarily dis- 
tinct) factors /, /' maps r/ to rri: the proof in the Appendix 
uses the canonical coverage C < , considers for each factor / 
all maximal variables under and chooses as root variable 
the maximum variable under >. Note that for queries with 
inversions Theorem 13.41 fails (recall the queries Hk). 

Example 3.5 We illustrate Theorem 13.41 on two queries. 

gi = R(x, y ),S(x,y),S(x',y'),T(y') 

q 2 = R(x,y),R(y,x) 

In the trivial coverage C = {qi } for q\ the factors are 

h = R(x,y),S(x,y) f 2 = S(x',y'),T(y') 

We see that rf 1 = {y} and r/ 2 = {y'} satisfy the properties 
of Theorem 13.41 (there are two maximal variables for /1 , but 
we have to pick y because it unifies with y'). For q 2 , the 
trivial coverage C = {53} does not work since there is a 
unifier that unifiers x with y, and exactly one of them can 
be the expansion variable. On the other hand, consider the 
following coverage: 

fx = R(xi,yi),R(yi,xi),xi > yi fi = R{x,x) 

now we can set Tf 1 = xi and r/ 2 = x. □ 
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Now, let q be a query without inversion and C = C, x) 
be any unary coverage. Let C* = (J-* ,C* ,x*) be the hier- 
archical closure of C. Theorem 12.221 applied to this unary 
coverage gives: 

p(q) = N(C*,sig(f))(-l)Wp(F*(f))p(f(a)) 
t|tMp-MC*) 

All the sets in T have arity 1, since C* is also unary, hence 
each join predicate has the form Ti(x),Tj(x) which is equiv- 
alent to Ti DTj 7^ 0, hence ip(C*) is a conjunction of predi- 
cates of the form T n Tj = 0. 

So far we have only added the independence predicates 
ip n (C*), i.e. independence predicates between those pairs 
hi and hj for which the join query is not equivalent to either 
hi and hj. Next, we add independence predicates between 
the remaining pairs. We generalize our technique of Exam- 
ple [2J2] We replace T with T , where T' contains all the 
sets in T along with some additional sets. For each hi,hj 
such that their hierarchical join is equivalent to hj , T' con- 
tains an additional set T,j. Denote ip ! (C*) the conjunction 
of the following predicates 

• A predicate T id C T 3 for all T id , Tj in f" 

• A predicate T nj n T; 2J = for all T lt] ,T i2 ,j in f 
such that there is a predicate T 1 PI T 2 = in ip(C*). 

Let ip'(C*) denote the conjunction of ip(C*) and ip'(C*). 
Then, we obtain p(q) — 

N(C*,sig(f))(-lfl J] II P(/( a )) 

T|T|=ip'(C*)Aip ! (C*) feT*a£T f 

Corresponding to each T £ T', let gi : A — » R denote the 
function gi(a) = — p(fi(a)). Also, corresponding to each 
Ti t j £ T", let 3^ denote the function gij{a) — 1. 

Theorem 3.6. Let q be inversion-free. 

1. The probability of q is given by 

P(q) = E iV ( C *^)0 + ^, A ! rr^^ (9) 

where © + ranges over all sets of the form T' . 

2. For each f £ J-* , f(a) is an inversion-free query. 

We use Proposition 13.21 to write a closed-form expression 
for Equation ((9]l in terms of the probabilities <?/(a) = p(f(a)) 
for / £ T* . Since each of these queries is inversion-free, we 
recursively apply Equation <(9j to compute their probabili- 
ties. For any query q, let V(q) denote the maximum number 
of distinct variables in any single sub-goal of q. Clearly, for 
any factor /, V(f(a)) < V(f) < V(q) (since a substitutes a 
variable in every sub-goal) . Thus, the depth of the recursion 
is bounded by V(q). 

Corollary 3.7. // q is an inversion-free query, then p(q) 
can be expressed as a formula of size 0(N V ^), where N is 
the size of the domain. In particular q is in PTIME. 

Example 3.8 Continuing our running example from Ex- 
ample 12.141 recall that p(q) is given by Equation (|8} ■ Let 
f = (Ti.Ta.Ta.Ti.s.Ta^). Denoting gi (a) = -p(/i(o)), 
for i = 1,2,3, ^ s (Ti n T 2 = 0) and ip = (Tj n T 2 = 



0) A (Ti flT 3 = 0) A (T 2 flT 3 = 0) A (T M nT 2 , 3 = 0) A (Ti, 3 C 
Ts) A (T 2l3 C T 3 ): 

P(«) = 0^(Si.52) + + (<? 3 ) + 0^(51, ff2,ffs) 

Now apply Prop. [3~2l to each expression, e.g. ©i (pi, 32, 33) = 
©,/, (01,32, 53) - ©^,(31,32) - • • • Each sum in turn has a 
closed form. Furthermore, each fi(a) is a query with a sin- 
gle variable (y or y'), hence each gi(a) — p(fi(a)) can be 
computed inductively. 

Appendix|X]gives example of inversion-free queries, show- 
ing several subtleties that were left out from the text. 

Queries with Negated Subgoals The PTIME algo- 
rithm in this section can be extended to queries with negated 
sub-goals. 

Definition 3.9. A conjunctive query with negations is a 
query q = 3x.(ifi A ... A tpk), where each tfii is either a pos- 
itive sub-goal R(t), or a negative sub-goal not(R(t)) , or an 
arithmetic predicate. The query q is said to be inversion-free 
if the conjunctive query obtained by replacing each not{R(t)) 
sub-goal with R(t) sub-goal is inversion-free. 

Definition 3.10. (Inversion-free property) A property (f> 
is called inversion-free property if it can be expressed as 
a Boolean combination of queries {31, • • • ,3m} such that 
each qi is a conjunctive query with negation and the query 
qiq'i ■ ■ ■ q-m is inversion- free. 

Theorem 3.11. Let <j) be any inversion- free property. Then, 
computing p((j>) is in PTIME. 

Proof. (Sketch) Consider a single inversion- free conjunc- 
tive query with negation. The same recurrence formula in 
Theorem 13.61 applies, the only difference is during recursion 
we will reach negated constant sub goals: p(not(R(a, b, c))) 
is simply 1 — p(R(a,b,c)). For any general <f>, use inclu- 
sion/exclusion formula to reduce it to conjunctive queries 
with negations, each of which is inversion- free. □ 

3.3 Complex Sums 

In Section T3.2I we used simple sums to give a PTIME al- 
gorithm for inversion-free queries. Here, we show that the 
PTIME algorithm can be used to compute closed formulas 
for complex sums. We call this the bootstrapping technique. 

Bootstrapping: Let g = (31, . . . , 3/5) be a family of func- 
tions, gi : A Ti — > R, where the arity of gi is n. We want 
to compute sums of the form sum = ^g(S), where <f) 
is a complex predicate. We cannot use the summations 
of Section 13.11 which only apply when gi are unary. In- 
stead, we use a bootstrapping technique to reduce this prob- 
lem back to evaluating an inversion-free query on a prob- 
abilistic database, and use the PTIME algorithm of Sec- 
tion The basic principle is that we can reduce the prob- 
lem to the evaluation of <j> over a probabilistic database. Cre- 
ate an probabilistic instance of S, where, assuming k — 1 
for simplicity, for each tuple ~a £ S, set its probability to 
p{a) — g(a)/(l + g(a)). Then, the probability of <f> over 
this instance is p(<f>) = Y2 s Tlaes P^Ha^st 1 ~ = 
n a 1/(1 + g{a)) £ S | ? g(S) = Ua V(l + ff(a))snm. Thus, 
we can compute sum in PTIME if we can evaluate the query 
4> in PTIME. 

Theorem 3.12. Let if) be an inversion-free property. Then 
, 3(5) has a closed form polynomial in domain size. 
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3.4 The General PTIME Algorithm 

Let q be a conjunctive query and let C — (J-, C) be a 
strict coverage for q and let Ti be the set of hierarchical 
unifiers, as defined in Section 12,61 Suppose the following 
holds: for every hierarchical join predicate jp = 9 T {hi,hj) 
between two factors in Ti, the join query jq = R (fi,fj) 
has an eraser. We will show here that q is in PTIME, thus 
proving Theorem QTHJ 2). 

We set the expansion variables x to include all variables, 
i.e. x f = Vars(f) for all / £ T. Let C* =JJF* ,C*,x*) 
be the hierarchical closure of C. By Theorem 12.221 we have 
p(q) = Exp(C*), where 

Exp{C*)= J2 N(C*,sig(f))(-lf l p(T*(f)) 

= £iV(C» J2 (-lf l p(f*(T)) (10) 

<r T|ip"(C*),si 9 (T) = CT 

Before we proceed, we illustrate with an example: 

Example 3.13 Consider the query q in Example [1T7] Al- 
though q has an inversion (between the two S Subgoals) we 
have argued in Sec. 11.11 that it is in PTIME. Importantly, 
the third line of constants sub goals plays a critical role: if 
we removed it, the query becomes #P-hard. 
Consider the coverage C = {T, C,x), where J^ifl: 

fi = R(r,x),S(r,x,y),U{a,r),U(r,z),V{r,z),r ^ a 

h = S(r',x',y'),T(r\y'),V(a,r')y j=a 

h = U(a,z'),V(a,z') 

fi = R(a),S(a,b,c),U(a,a) 

and C = {{/i, fa, fi}, {/2, /3, ft}}- We cannot simply take 
the root variables r, r', and z' as expansion variables and 
proceed with the recurrence formula in Th. 13.61 because the 
query /12 = /i(f)/2(r) is #P-hard. We must keep all vari- 
ables as expansion variables to avoid the inversion. Thus, 
the root unifiers Ti are (recall Example 12. 17[) : 

/12 = h,h,r = r' 

/23 = f2,h,r' = z' 

/13 = fi,h,r = z' 

/123 = /i,/a,/3,r = r' = z 

Out of these, /12 and /123 have inversions, thus T*{q) = 
{fit hi fa, fi, fn,fn}- In the expansion Exp(C*), there are 
sets Ti, T2, T3, X4, T23, T13 but note that they are not unary, 
e.g. Ti has arity 4as%, = {r, x, y, z}. The critical question 
is how to separate now Ti from T2, since we don't have 
the factor /12. Here we use the fact that there exists a 
homomorphism fs — » /12, thus fi is an eraser between /1 
and /2 and will use fz to separate Ti, T2. The definition 
of an eraser (Def. I2.2ip requires us to check Va, N(C,a U 
{/i,/2» = iV(C,cr U {/i,/2,/ 3 }). The only a that makes 
both iV's non-zero is {/4} (and supersets), and indeed the 
two numbers are equal to +1. It is interesting to note that, 
if we delete the last line from q, then we have the same set 
of factors but a new coverage C' = /a}, {/aj /3> /*}}: 

then fs is no longer an eraser because for a — we have 
N{{fx,h}) = 1 and jV({/i,/ 2j / 3 }) = 0. Continuing the 
example, we conclude that, with aid from the eraser, we can 

3 Strictly speaking each constant sub-goal R(a), S(a,b,c), 
U (a, a) should be a distinct factor. 



now insert all independence predicates. We have to keep in 
mind, however, that these predicates are no longer simple 
disjointness conditions e.g. the predicate between Tj and T2 
is the negation of the query Ti(r,x,y,z),T2(r,x',y'). □ 

We now focus on each of the inner sums in Equation l|10p . 
We want to reduce it to evaluation of an inversion-free prop- 
erty, but there are two problems. First, the predicate ip n (C*) 
over T is not an inversion- free property. Second, we still 
need to add the predicates ip'(C*) to make p{T*{T)) multi- 
plicative. To solve these problems, we apply a preprocessing 
step on Equation II 01 which we call the change of basis. In 
this step, we group T that generate the same J-*(T) and 
sum over these groups. 

Example 3.14 Consider a factor / = Ri(x, y), J?2(j/, z). 
We look at the set T(x,y,z) corresponding to this factor, 
which is a ternary set since xj — {x,y,z}. For every T, 
define S° = ir y (T), S 1 = Tv xy (T) and S 2 = ir y (T), hence 
T = S° N S 1 N S 2 (natural join). Clearly, S°, S 1 , S 2 satisfy 
the predicate S° = ■k v (S 1 ) = ir y (S 2 ). Consider the sum 

^(-l)l f lp(/(T)) (11) 

T 

We group all T that generate the same S° , S 1 , S 2 and show 
that the summation in Eq. [TT]is equivalent to the following: 

J2 (~l) ]sll+ls2l+ls ° { p(R 1 (S 1 )R 2 (S 2 )) 
sKs^s 

S° = 7r y (S 1 ) = 7r y (S 2 ) 

Thus, we have changed the basis of summation from T to 
S°,S\S 2 . a 

The change of basis introduces some new predicates between 
sets, which we call the link predicates, e.g. predicates of the 
form S° — ^^/(S 1 ). But at the same time, as we shall see, 
the change of basis simplifies the independence predicates 
ip(C*), making them inversion- free, so that the computation 
of Equation (|10|l can be reduced to evaluation of inversion- 
free queries. We now formally define the change of basis. 
This consists of the following steps: (1) we change the sum- 
mation basis from T to S. (2) we translate the ip n (C*) pred- 
icates from T to S. (3) we introduce a new set of predicates, 
called the link predicates, on S. (4) We add the remaining 
independence predicates, ip*(C*), translated from T to S, 
to S. 

Consider a factor / £ T* . It is a connected hierarchical 
query with the hierarchy relation C on Vars(f). Given x £ 
Vars(f), let [a;] denotes its equivalence class under C and 
let [x~\ denote {y | y □ x}. Define a hierarchy tree for / 
as the tree where nodes are equivalence classes of variables, 
and edges are such that their transitive closure is C. For 
instance, in Example l3.14l the hierarchy tree of / has nodes 
{x}, {y}, {z} with {x} as root and {y}, {z} its children. 

Define a new vocabulary, consisting of a relation Sj- for 
each / £ T* and each node [a;] in the hierarchy tree of /, 
with arity equal to the size of \x\ . Let S denote instances of 
this vocabulary. The intuition is that denotes 7T|-^ (Tf) 
in the change of basis from T to S. This completes step 1. 

Let ip n denote the set of independence predicates on S, 
translated in a straightforward manner from the indepen- 
dence predicates ip n (C*) on T (details in appendix). This 
is step 2. 
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Define a link predicate Sj = n\x'} OS/ ) f° r every edge 
([x], [y]) in the hierarchy tree of /. Let lp be the set of all 
link predicates. This is step 3. 

Finally, we add the trivial independence predicates ip . 
For this, we expand the basis of summation from S to S' by 
adding the following sets. We add a new set S^f,^- corre- 
sponding to each pair Sp*\ S 1 ^' such that (i) /; and fj have 
a hierarchical join query which is equivalent to fj and (ii) 
there are sub-goals gi in hi and gj in hj referring to the same 
relation such that Vars(gi) = \x{\ and Vars(gj) = \xf\. 
For each such S%f lX . , ip* contains the following conjuncts: 

s h i] n s f sl = > s *i*i - sl f 3 j] - This describes the ste p 4 - 

Finally, we put it all together. We define a function G(S') 
on S' as follows. Consider a relation S * , and let p be the 
number of children of [x] in the hierarchy tree. For a tuple 
t in Sf\ let 

at) = (-ir +i n p(»(*)) 

gesg(f)\Vars(g)—\x] 

Define G(S') = T\ te s> G(t). 

Denote sig(S') the set {/ | S^I^ 7^ 0}, where [r/] denotes 
the root of the hierarchy tree of /. 

Theorem 3.15. With ip* , ip", lp, sig and G as defined 
above, 

J2 (-i)i*i P (^(f))= J2 G ( 5 ') 

T\ip(C* ) ,sig(T) = a S ; \ip n ,ip t ^lp,sig(S') — a 

Finally, we use the bootstrapping principle to reduce the 
problem of computing the summation to the evaluation of 
the query (j> = (ip n A ip* A lp A sig(S') = a). 

Lemma 3.16. The query <j> defined above is an inversion- 
free property. 

By using Theorem 13.121 we get the following: 

Theorem 3.17. Suppose for every hierarchical join pred- 
icate jp = Q T (hi,hj) between two factors in TL, the join 
query jq = 9 R (fi, fj) has an eraser. Then, q is PTIME. 

4. #P-HARD QUERIES 

Here we show the other half of Theorem 11.81 i.e., if q has 
an inversion without an eraser, then q is #P-hard. 

Let C = (J-, C, x) be any strict coverage for q, C* = 
(fF* ,C* ,x*) its closure and TL the set of hierarchical join 
queries over T . 

Suppose there are factors h, hi £ TL such that the join 
query hj — 8 T (h,h') has an inversion, but not an eraser. 
Among all such hj, we will pick a specific one and use it to 
show that q is #P-hard. Note that if there is no such hj, 
then the query is in PTIME by Theorem 13.171 

Let the inversion in hj consist of a unification path of 
length k from (/, x, y) with x C y to (/', x' , y') with x' □ y' . 
Then, we will prove the #P-hardness of q using a reduction 
from the chain query Hk, which is #P-hard by Theorem 1 1.5 1 

Given an instance of Hk, we create an instance of q. The 
basic idea is as follows: take the unification path in hj that 
has the inversion and completely unify it. We get a non- 
hierarchical query (due to the inversion) with two distin- 
guished variables x and y (the inversion variables), k + 2 



distinguished sub-goals (that participated in the inversion), 
plus other sub-goals in the factor. Use the structure of this 
query and the contents of the k + 2 relations in the instance 
of Hk to create an instance for q. We skip the formal descrip- 
tion of the reduction, but instead illustrate it on examples. 

Example 4.1 Consider q = U(x), V(x, y), V(y, x) and the 
coverage C = (J-, C) where T = {/} with / = U(x), V(x, y), 
V(y,x),x 7^ y and C = {{/}}■ The coverage has a single 
factor and a single cover. The first V sub-goal of factor / 
unifies with the second sub-goal of another copy of / to give 
an inversion between x □ y and their copy y C 1'. If we 
unify the two sub-goals in two copies of /, we get the query: 

qu = U{x),nx,v),V{y,x),U{y) 

We have underlined the sub-goals taking part in the inver- 
sion. Now we give a reduction from the query Ho = R{x), 
S(x, y),S(x' , j/'), T(y'). Given any instance of R, S, T for Ho 
construct an instance of U, V as follows. We map the R, S, T 
relations in Ho to the U, V, U underlined sub goals of q u as 
follows: for each tuple R{a), create a tuple U(a) with same 
probability. For each S(a,b), create V(a, b) with the same 
probability. For each T(a) , create U (a) with same probabil- 
ity. Also, for each S(a,b), create V(b, a) with probability 1 
(this corresponds to the non- underlined sub-goal). 

There is a natural 1-1 correspondence between the sub- 
structures of U, V and the substructures of R, S, T with the 
same probability. It can be shown that q is true on a sub- 
structure iff the query R(x),S(x, y) V S(x', y'),T(y') is true 
on the corresponding substructure. Thus, we can compute 
the probability of the query R(x),S(x,y) V S(x' ,y'),T(y'), 
and hence, the probability of Ho, by applying inclusion- 
exclusion. 

Next, we show why a hardness reduction fails if the inver- 
sion has an eraser. 

Example 4.2 We revisit the query a in Examplc l3.13l There 
is an inversion between x C y in /1 and x' □ y' in /is. How- 
ever, their hierarchical join, /12 have an eraser. The unified 
query consists of q u = 

R(r, x),S(r, x, y), U(a, r), U(r, z), V(r, z), V(a, r),T(r, x) 
R(a),S(a, b, c), U(a, a) 

We construct an instance RSTUV for q from an instance 
R'S'T' for Ho as in previous example. However, there is 
a bad mapping from q to q u , corresponding to the eraser, 
which is {r — » a, x — * b, y — > c, x' — > x, y' — > y, z — > r}, 
which avoids the R sub-goal. The effect is that q is true on 
a world iff the query S'(x' , y')T'(y') (rather that Ho) is true 
on the corresponding world. So the reduction from Ho fails. 
In fact, we know that this query q is in PTIME. 

The final example shows that if there are multiple inver- 
sions without erasers, we need to pick one carefully, which 
makes the hardness reduction challenging. 

Example 4.3 Consider the following variation of the query 
in previous example: 

q =R{x),S(x, y),U(x,y,a,b),U (zi ,z 2 ,x,y), V{z± ,z 2 ,x,y) 
S(x',y'),T(y'),V(x',y',a,b) 
R{a),S(a,b),U{a,b,a,b) 
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Let /i and fa denote the factors corresponding to the first 
two lines of q. There is an inversion from x □ y in /1 to 
x' C y' in / 2 via the two S sub-goals, and it does not have 
an eraser. But if we unify the two S sub-goals to obtain S, 
there is a "bad mapping" from q to q u that maps x, y to a, b 
and Zi, Z2 to x, y. However, as it turns out, there is another 
inversion in q that we can use for hardness. The inversion is 
from a; □ y to z\ = Z2 to x' , y through the following unifica- 
tion path: U(x,y,x,y) unifies with (a copy of ) [/(z^ z_ 2 > x, y) 
and V(z^, z_ 2 , x, y) unifies with V(x', y' , a, b). We can show 
that this inversion works for the hardness reduction. 

By formalizing these ideas, we prove: 

Theorem 4.4. Suppose there are h, h' 6 TC (q) such that 
their hierarchical join hj has an inversion without an eraser. 
Then, q is #P '-complete. 

5. CONCLUSIONS 

We show that every conjunctive query has either PTIME 
or #P-complete complexity on a probabilistic structure. As 
part of the analysis required to establish this result we have 
introduced new notions such as hierarchical queries, inver- 
sions, and erasers. Future work may include several re- 
search directions: a study whether the hardness results can 
be sharpened to counting the number of substructures (i.e. 
when all probabilities are 1/2); an analysis of the query 
complexity; extensions to richer probabilistic models (e.g. 
to probabilistic databases with disjoint and independent tu- 
ples [9]); and, finally, studies for making our PTIME algo- 
rithm practical for probabilistic database systems. 
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APPENDIX 



A. EXAMPLES OF INVERSIONS 

We illustrate in Fig. [1] several subtleties of inversion-free queries that were left out from the text. Fig. [2] illustrates some 
queries with inversions; all are #P-hard. 



Query. The trivial coverage 

is non-strict and has an "inversion" 


Fragment of a strict coverage 
(Unification chain underlined) 


Comments 


R{x)S\(x,y,y) 

Si (u, v, w), S 2 {u, v, w) 

St{xL,x' ,i/),T(y') 


qa = R(x), Si(x, y,y),x ^ y, 

Si (u, v, v), S 2 (u, v,v),u ^ v 
S 2 (x',x',y'),T(y'),x' ^y' 

qc 2 = R(x), Si(x,y, y),x ^ y 

Si (u, u, w), S 2 (u, u, w), u ^ to 
S 2 (sl,x',£),T(y'),x' ^y' 


Illustrates the need for a strict cov- 
erage. The unification path form- 
ing an inversion in q in the trivial 
cover (which is non-strict) is inter- 
rupted when we add 7^ predicates to 
make the cover strict. 


R(x 1 ,x 2 ),S(xj_,x 2 ,y,y), 

S(xi,xi,x 2 ,x 2 ) 

S(xL,x',y^,y'),T{y') 


qc = R(x, x), S(x, x, y, y), 
S(x, x, x, x),x ^ y 

at f f 1 l\ rr^l l\ III 

S(x_,x ,y_,y ),T(y ),x ^ y 
= R(x, x), S(x, x, x, x), 

S(x',x',y',y'),T(y'),x' ?y' 


This illustrates the need to minimize 
covers. The inversion disappears after 
minimizing qc. 


R(x 1 ,x 2 ),S(xj_,x 2 ,y, y) 
S(xi,x 2 ,xi,x 2 ) 


qci = R(x,x),S(x,x,y,y),x y 

S(x^,x',y[,y'),T(y\y'),x' j^yi 

S(x, x, x, x) 
qc 2 = R(x, x), S(x, x, x, x), 

S(x',x',y',y'),T(y',y'),x' ^y' 


This shows that we should not consider 
redundant coverages. There is an in- 
version in qa, but this cover is con- 
tained in qc 2 so it is redundant and 
after we remove qc\ from the coverage 
there is no more inversion. 



Figure 1: Inversion-free queries: all are in PTIME. 



B. PROOF OF THEOREM 1.4 

Let P be a conjunctive formula and A be a structure. We say that P is decisive w.r.t. A if there exists a function 
c : A — > Var(P) s.t. for any homomorphism h : P — > A there exists an automorphism i : P — > P s.t. denoting h' = h o i we 
have co h' — idp. The function c, which we call a choice function, "chooses" for each node u in A a variable x — c(u) in P 
such that any homomorphism from P to A maps x to u, up to renaming of variables in P. Let 5* be a class of structures. We 
say that P is decisive w.r.t. 5* if it is decisive w.r.t. to each structure in S. 

In the sequel we will make use of the following two classes of graphs. A ^-partite graph has nodes partitioned into four 
classes Vi, i = 1, 2, 3, 4, and edges are subsets of Ui=i Vi x Vi+i- A triangled- graph has a distinguished node vq and two disjoint 
sets of nodes Vi, V 2 s.t. edges are subsets of ({fo} x Vi) U (Vi x V 2 ) U (V 2 x {vo})- 

Example B.l The query below checks if a graph has a chain of length 3: 

P_3 = E(x,y), E(y,z), E(z,u) 

Then P3 is decisive on the set of 4-partite graphs. To see this, the choice function simply chooses to map Vi to x, V 2 to y, V3 
to z and Vi to u. 

Example B.2 The query below checks if the graph has a triangle: 
T = E(x,y) , E(y,z) , E(z,x) 

Then T is decisive on the class of triangled graphs. To see this, consider a triangled graph G and define c to map Do to x, 
Vi to y and V 2 to 2. A homomorphism h : T — > G may map x to some other node than Vq, but after a proper rotation 
(automorphism) we transform h into a homomorphism hoi that is consistent with c. 

Note that T is not decisive on the class of all graphs. For example it is not decisive on the complete graph K4. 

Our interest in the two queries above and their associated classes of decisive structures comes from the fact that their 
complexity is #P-complete: 
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Query 


Fragment of a strict coverage (inversion underlined) 


Comments 


R(x,y),R(y, z) 


qc = R(x,y), R(y,z) 
qc = R(xi,i/),R(y' ,z') 


Here and the inversion is between y □ 
z and x' C 1/ in a copy of itself. 




qci 


= R(x),Si(x,y),x > y, 




R(x) Si (x v) 
Si(ui,vi), S2(ui,vi) 

S2(U2,V2), S2(V2,U2) 


qC2 


Si (m, vi) , S2 («i , Vi),ui > "1 

S2(U2,V2), S2(V2,U2),U2 > V 2 
— n,yx ) , 01 ^x, y ), x ^ t/, 

Si(ui,£l), S2(ui,2l), 111 < 
S2(«2, t>2), S2(V2, U%), U2 < V2 


Here x Zl y, Ui = Vi , U2 = v 2 and the 
inversion path goes twice through each 
factor. We call this an open marked 
ring. 


R(x),S(x,y),S(y,x) 


qci = 
qc2 - 


= R(x),S(x,y),S(y,x),x <y 

- R(x'),S(x',y'),S&,x!_),x' >y' 


Here x □ y and the inversion is be- 
tween x,y and their copy y',x'. We 
call this a marked ring. 




qa 


= R(x),S(x,y,y),x^y, 




R(x), Si(x,y), 
Si(u 1 ,v 1 ),S 2 (u 1 ,v 1 ) 

S2(U2,V2), S2(V2,U2) 


qc2 


T(u, v), S(u, «,»],ii/u, 
U(y'),S{x',y',x'),x' ^ y 
= R(x),S(x,y,y),xjty 

T(w,v), S(w,v, w),w V, 
U(y'),S(d.,y^,x'),x' ^y' 


Here the inversion path goes twice 
through the subgoal S(u, v, w) using 
different pairs of variables. 



Figure 2: Queries with inversions: all are #P-hard 



Proposition B.3. Let P3 be the 3-chain property in Examvle \B. 11 The complexity of computing P[P3] on J^-partite graphs 
is #P '-complete. 

Let T be the triangle property in Examvle \B.S\ The complexity of computing P[T] on triangled graphs is #P -complete. 

Proof. By reduction from the problem of computing the probability of bipartite 2DNF formulas. Let X = {xi, . . . ,x m } 
and Y — {yi, . . . ,y n } be two disjoint sets of Boolean variables, and consider a bipartite 2DNF formula: 

* = V **» A W* ( 12 ) 

k=l,t 

Construct the following 4-partite graph: Vo = {u}, Vi = X, V% = Y, Vi = {v}, where u, v are two new nodes. All edges from 
u to Xi are present and their probability if P[xi] ; for each clause Xi k A yj k in (|12f) there is an edge (xi k , yj k ) with probability 
1, and all edges {yj,v) are present and have probability P[j/j]. Clearly the probability that this graph has a path of length 3 
is precisely P [<fr]. This proves the hardness of P3. The hardness of T is obtained similarly, by merging u and v into a single 
node. □ 

Theorem B.4. Let Q be a conjunctive formula, which is minimal, and let P be subformula. If there exists a class of 
structures S s.t. (1) P is decisive on S and (2) P is #P -complete on S, then Q is #P -complete on the class of all structures. 

Proof. We reduce the problem of evaluating P on some structure in S to the problem of evaluating Q on an arbitrary 
structure. Let A G S, and c : A — > Var(P) be a choice function. We construct a new structure B as follows. First define 
H — {h: P^A\coh = idp} to be the set of homomorphism from P to A that are consistent with the choice function. 
Note that H is polynomial in the size of A since P is fixed. Define the new structure B as follows. Its nodes, B are obtained 
as follows. First define the set iV = {(x,h) \ x £ Var(Q),h £ H}; next define the equivalence relation (x,h) = (x',h') if 
(x, h) — (x',h'), or if x — x' £ Var(P) and h(x) = h'(x) (i.e. collapse multiple copies of the same variable from P if they 
are mapped to the same node in A). The nodes in B are equivalence classes [(x,h)], i.e. B = N/ =. The relations in B 
are of the form Ji([(a;i, h)], ... , [(xk, h)]), where R(xi, . . . , Xk) appears in Q, and h £ H. One can think of B as consisting of 
multiple copies of Q, one for each possible way of mapping P into A, but such that all copies of the same P-variable that 
are mapped to the same node u G A are merged into a single node. The latter are precisely the nodes of the form [(x, h)] for 
x 6 Var(P), and we call them the special nodes in B. Thus, the special nodes in B form a substructure that is isomorphic to 
some substructure Ao of A, which is large enough to contain the image of all homomorphism from P to A. The probabilities 
are as follows. If xi, . . . , Xk G Var(P) then 

P B (R([(x u h)], [(n, h)})) = P A (R(h( Xl ), h(x k ))) 
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; otherwise Pb(R{[(xi, h)], . . . , [(xk,h)])) = 1. Note that there is a 1-to-l correspondence between the worlds Wa of A and 
the worlds W B of B, and P\W A ] = P\W B ]. 

Claim 1. Let Wa be a world of A s.t. W A \= P- Then, denoting Wb the corresponding world of B, we have Wb \= Q- 
Indeed, let h : P —* A be a homomorphism whose image uses only tuples in Wa- We can assume w.l.o.g. that it is consistent 
with the choice function, i.e. c o h — idp (otherwise simply compose it with the automorphism i), hence h £ H. Extended it 
to a homomorphism h : Q — » B by defining h(x) — [(x, h)]: it clearly only uses tuples in Wb- 

Claim 2. Let Wb be a world of B s.t. Wb \= Q- Then, denoting Wa the corresponding world of A we have Wa \= P- 
Let h : Q —* B be a homomorphism. If h maps Var(P) only to the special nodes in B, then we are done; but this may not 
necessarily be the case. We will prove instead that there exists some automorphism g : Q — > Q s.t. hog maps Var(P) to the 
special nodes in B. 

Define the function / : B — > Var(Q) to be f([x, h]) — x; one can check that it is a homomorphism from B to Q, and that 
all special nodes and only these are mapped to Var(P). Consider the composition / o h : Q — + Q, which is an isomorphism 
(since Q is minimal); in particular h^ 1 is functional, i.e. \h~ 1 (u)\ < 1. Define g = (f o h) -1 to be its inverse. Then hog 
maps Var(P) to the special nodes in B. Indeed, for any variable x £ Var(P), f~ (x) consists only of special nodes, hence 
h(g(x)) = h(h~ 1 (f~ 1 (x))) = Dom^ 1 ) n f^ 1 (x) is a special node. □ 

Theorem B.5. Let P — Ri(vi), R2(m), -££3(^3) be a conjunctive property, which is minimal, and for which there exists two 
variables x,y s.t. x € Vi,x € V2,x $ €3 and y vi, y £ V2, y £ ^3- Then there exists a class of structures S s.t. (a) P is 
decisive w.r.t. S and (b) P is #P-complete on structures in S. Note that Ri,R2,Rz may be any relation names, possibly the 
same relation name. 

Proof. By reduction from partitioned 2DNF. Consider Eq. (|12|l . and recall that the variables are X = {x\, . . . , x m }, 
Y — {yi, . . . ,y n }. Let U = {u\ , it2, • ■ • , itfc} be all the variables occurring in P in addition to x and y, and C be the set of 
constants. Define the structure A s.t. A — XuYuUUC, and the relations are defined as follows: 

Ri = {Ri(vi[xi/x]) I i = l,m} 

R2 = {Ri(yi\xi h /x,yj h ]/y) I k = l,t} 

Rz = {R 3 {v 3 \y j /y])\j = l,n} 

Thus, the tuples in the first set correspond to the Boolean variables Xi, those in the second set correspond to clauses Xi k Ayj k , 
and those in the third set correspond to the Boolean variables yj. Note that the three sets defined on the right are disjoint: 
if two or more of the relation names R\,R2,Rz are the same, then their interpretation in A consists of the union of the 
corresponding right hand definitions above. The tuple probabilities are as follows: those in R^ are precisely P(x;), those in 
R2 are 1, and those in R$ are precisely P (yj). 

We first show that P is decisive on A. Define the choice function c : A — > Var(P) to be c(xi) — x for i = 1, m, c(yj) = y 
for j = 1 , n and c(u p ) — u p for p = l,k. We need to prove that every homomorphism h : P — > A is, up to isomorphism, 
consistent with the choice function. For that we note that the choice function itself is a homomorphism c : A — » P, hence 
coh : P — + P is an automorphism (since P is minimal), and we denote i — (co/1)" 1 . We show now that hi = hoi is consistent 
with c. Indeed: coh' — coho (co h)' 1 = idp. 

Next we prove that the probability of P being true on A is the same as the probability that $ is true. There is an obvious 
one-to-one correspondence between worlds Wa of A and truth assignment for 4>: the tuple in Ri corresponding to Xi occurs 
in Wa iff Xi = true, and similarly for R% and the yj's. Clearly if the truth assignment makes $ true, then P is true on 
Wa- simply pick two variables Xi and yj that are both true under the truth assignment, and note that P can be mapped 
to the three tuples corresponding to Xi, to the clause Xi A yj and to yj respectively. Conversely, suppose P is true on Wa, 
i.e. there exists a homomorphism h : P — » A whose image is contained in Wa- Since P is decisive on A there exists another 
homomorphism h' : P — » A that is consistent with c, i.e. it maps x to some x% and y to some yj. Then Im(h') consists 
of three tuples R^(v\[xi/x]), R2(v2[xi k /x,yj k /y]), and R$ (v3[yj/y\), and, moreover Xi k f\yj k is a clause in which is true 
under the truth assignment corresponding to Wa- 

□ 

Corollary B.6. Let Q be a non-hierarchical conjunctive query. Then Q is #P-hard. 

Proof. Consider the minimal conjunctive query defined by Q. Since Q is non-hierarchical, there must be two variables 
x and y such that sg(x) n sg(y) 7^ 0, sg(x) — sg(y) 7^ and sg(y) — sg(x) 7^ 0. Thus, the minimal query must contain a 
subformula P — Ri(vi), ilz^), #3(^3) s.t. x 6 «i, x 6 V2, x $ V3 and y £ vi,y 6 V2,y 6 V3. 

It follows from the previous two results that Q is #P-hard. □ 

C. PROOF OF THEOREM 1.5 

We will prove here that for every k > 0, is #P-hard. Recall that 
Hu = 

R(x),S (x,y), 

So(ui,Vi),Si(ui,Vi) 
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Si(u 2 ,v 2 ),. . . 

Sk-l(u k ,V k ),Sk(Uk,V k ) 

S k (x',y'),T(y') 

Define queries (fio, ■ ■ ■ ,<t>k+i, where 

0o = R(x),S (x,y) 
4>i = Si-i(u, v),Si(u,v) for 1 < i < k 
4>k+i = S k (x',y'),T(y') 

Thus, H k — Aigifci 4>i- F° r an y proper subset S of [k], the query Ai<=s ^ s m PTIME(this follows from a result we prove later 
that every inversion- free query is in PTIME). Using the principle of inclusion-exclusion, to show the hardness of H k , it is 
enough to show the hardness of the query V»e[fc] ^* ' s naK1 > or equivalently, its negation q = /\ ie , k JNOT<fii). 

We give a reduction from the problem of computing the probability of bipartite 2DNF formulas. Let X — {xi, . . . ,x m } 
and Y = {j/i, . . . ,y n } be two disjoint sets of Boolean variables, and consider a bipartite 2DNF formula: 

<& = V ^i. ( 13 ) 

h=l,t 

We construct an instance for relations R,So,--- ,S k ,T. For each variable Xi 6 X, create a tuple R(xi) and assign it a 
probability 1/2. For each j/; € F, create a tuple T(j/j) and assign it a probability 1/2. For each clause (xi h ,yj h ), and for each 
I £ [fc], create a tuple Si((xi h ,yj h ) and assign it a probability pi for Z = 0, k and a probability of p 2 for 1 < I < k — 1. 

Let Tij be the number of assignments of $ such that i clauses have both variables true and j clauses have no variables 
true. Thus, [t — i — j) have exactly 1 variable true, where t is the number of clauses. 

There is a canonical mapping between the truth assignments of X, Y and worlds of relations S, T where x £ X is true iff 
S(x) is present and y € Y is true iff T(y) is present. 

Consider some fixed assignment where i clauses have both variables true and j clauses have no variables true. Fix relations 
R,T accordingly and consider all possible worlds of Si, • ■ ■ ,S k such that q is true on the worlds. For each (xi h ,yj h ), consider 
all tuples of the form Si(x ih ,yj h ): 

1. If x ih and y ih are true, the tuples So(xi h ,yj h ) must be both out, and other edges do not matter. Its probability is 
(1-Pi) 2 

2. If one of them is true, one of the tuples So(x ih ,yj h ) must be out (depending on which variable is true), and other edges 
do not matter. Its probability is (1 —pi). 

3. If Xi h and yi h are both false, the only requirement is that not all Si(xi h ,yj h ) are in. Its probability is (1 — plp 1 ^ 2 )- 
Thus, its total probability of all worlds corresponding to this fixed assignment is 

(i/2)' x '+' y '[(i -pifna - P ip k 2 - 2 )] j [(i -pi)]"-*-* 

This can be written as KA i B j , where K = (l/2) |x| + |y| (1 - p) c , A = (1 - pi) and B = (1 - pip 2 ~ 2 )/(l - pi). 
Thus Pr[q] = ^ itj:i+j ^T i , j K.A i .B' 

This is a linear equation in variables T<j. We put different values of pi, pa to get different values of A,B and get a system 
of linear equations. The coefficient matrix of this set of equations is the Vandemonte matrix which is known to be invertible. 
By inverting the matrix, we solve for each Tij. Finally, we can compute the number of satisfying assignments of <f> using 
J2 { j\i+j<t,j&Ti,j- This gives a polynomial time reduction from the problem of computing H k to counting the number of 
satisfying assignments of a bipartite DNF formula. Hence, H k is #P-hard. 

D. PROOF OF THEOREM 2.7 

Consider some probability space. Let U = (Ui, ■ ■ ■ , U k ) be a vector consisting of k sets. For each i £ [k] and each x £ Ui, let 
E(i,x) be an event in the probability space. Define E(i) = \J xeu . E(i,x). Let Q be a CNF formula over events E(l), ■ ■ ■ E(k), 
i.e., let i/> be a set of subsets of [k] and let 

q = V A E ^ ( 14 ) 

We will derive an expression for Pr[Q] in terms of the probabilities of the events E(i, x). We need some notations. A signature 
is simply a subset of [k\. Given a vector of sets S = (Si, • • • , S k ), the signature of S, denoted sig(S), is the set {i | Si 7^ 0}. 
E(S) is defined as the event Aig[fc] Si). The size of U is defined as |C7| = \Ui \ + ■ ■ ■ + \U k \. Also, given vectors S and T, 
we say that S C T iff for all i € [k], S» C T». 

Define the upward closure of ij) as wp(ip) = {sg | sg C [k], 3sgo € V s -^- s 9o sg}. Define the minimal elements of xj) as 
Factors(ip) — {sg \ sg G ip,Vsgo £ ip. sgo C sg => sgo = sg}. For a set of signatures G, let sig(G) = U sg eGsig(sg). Given a 
signature sg, define 

N(sg) = (-1) |S91 J2 

G\GCFaCtOrS(ii),sig(G) = sg 

Our main result is follows: 
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Theorem D.l. With U , ip and Q as defined above, 

Pr[Q] = J2 N(sig(S))(-l)^ 
ScU 

We will need the following result later which gives an alternate formula for N(sg). 
Lemma D.2. N(sg) = Y\ , r ajTPI ,n(-l) |s90 '- 

\ til ^{sg \sg Qsg,sg gUP{^)i\ ' 

In the rest of the section, we prove this theorem. 

Let * be an element such that * ^ Ui for aU i and define U* = Ui U {*}. Given an element x G Ui x • • • x U£ , the signature 
of x, denoted sig(x), is a subset of [k] given by {i \ m(x) 7^ *}. Given a vector of sets S = (Si, ■ ■ ■ , Sfc) where Si C Ui, define 

IMS) = {x I x G (Si U {*}) X • • • (S fc U {*}), sig(a;) G V} 

Given a vector x G II^([/), define -E(cc) = Ate«s(a) 7r i( a; ))- Then, from Eq flD , it follows that 

Q = V 

Using inclusion-exclusion, we obtain 

Pr[Q]= E (-1) T ^[A ( 15 ) 

TCn„([/) »6T 

For a set T C U^,(U), define 7r,(T) = {^(x) \ x G T,n l (x) / *}. Also, define E(i,S) = f\ eeS E(i,s). Then, /\ xeT E(x) = 
S(l,7ri(T))A--- AE(k,7Tk(T)). 

In Eq (|15l we group the T based on their projection to obtain 

Pr[Q]= J2 Pr[/\E(i,Si)]*( ]T (-if) (16) 

Si,---,s k »e[fe] Tcn^,(C7),7r i (T)=s i 

Let N(Si, ■ ■ ■ ,Sk) denote the sum 5^ TCn ^ ^j^g (—1) T . Thus, 

Pr[Q]= ]T N(S lr -- ,S k )Pr[/\ E(i,Si)] 

s x ,-,s k ie[k] 

The main result of this section is an expression for the quantity N(Si, - ■ ■ , Sfc). Given a vector S = (Si, ■ ■ ■ , Sk), define the 
signature of S, denoted sig(S), as the set {i | Di 7^ 0}. 

In an ordered set (X, <), an idea/ is a set of the form {x \ x < a}, for a fixed element a £ X, which we denote by [a]. 

Lemma D.3. // [A] is an ideal in P(U), then E{t|ts[A]}(~1) T = if A is nonempty, and is 1 if A is empty. Note that 
T G [A] means T C A. 

For S = (Si, • • • , Sfc), denote 

ND(S)= J2 (- 1 ) m 
Tcn^(S) 

Lemma D.4. N(S) = Ej?cS(~ 1 )' S_iiliVI) (- R )- Here R = ( Rl > ■ ■ ■ , Rk) and R C S means R t C S /or aZZ i. 
Proof. Direct inclusion-exclusion applied to N(S). □ 
Define up(tf)) — {sg \ sg C [k], 3sg' G ips.t.sg' C sg}. 

Lemma D.5. 1. If sig(R) G up(ip), then ND(R) = 0. 
2. Ifsig(R) £ up(ip), then ND(R) = 1. 

Proof. Follows from the fact that sig(R) G up(i/>) iff H^,(R) / and from the fact that ND(R) sums (— 1) |T| , where T 
ranges over the ideal defined by ILy,(i?). □ 

Hence, JV(S) = (-1)' S ' E { J ? cS: S19(j R)^ pW} ("1) jR - 

Let so be a signature, i.e. C [fc]. Denote the quantity M(S,sg) = ^2 R cS'si g (R)=sg(~ 1) R - Thus, we have: 

iV(S) = (-i) S * J2 M ( S > S 9) 

sg£up(i/>) 

For a signature sg' C [k], denote: 

MD(S,sg') = 

RCS:sig(R)Csg' 
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Lemma D.6. M(S, sg) = T, sg , Qsg (-^sg - sg'\*MD(S, sg') 
Proof. Again inclusion/exclusion formula applied to the set sg. □ 

Lemma D.7. 1. If sg' n sig(S) 0, then MD(S, sg') = 0. 
2. Ifsg' n sig(S) = 0, then MD(S,sg') = 1. 

Proof. Follows from the fact that the set {R \ R C S, sig(R) C sg'} is an ideal, and it is nonempty iff sg' C sig(S). □ 
Next, we manipulate the expression M(S,sg) as follows. We have M(S,sg) — ( — l) 33 M'(S,sg), where: 

M'(S,sg) = i-^-Y 9 ' MD{S,sg') 

sg'Csg 

E (- 1 ) 89 ' 

sg' Qsg,sg'nsig(S) — 

E (- 1 )* 9 ' 

sg' Q(^g — sig(S)) 

This is a sum over the ideal generated by sg — sig(S). This ideal contains only the empty set when sg C sig(S), hence: 

Lemma D.8. 1. If sg C sig(S'), tfien M(S, sg) = 
2. //sg % sig(S), then M(S, sg) = 0. 

Hence, 

N(S) = (-l) s * E M i S ^9) 

sg^up(V>) 

= ("if* E (-!) S 3 

sff^up(V>),sgCsig(S) 

= (-i) s * E (-i) S9 -(-i) s * E (-!) S9 

sgCsig(S) sgeup(V>) ,sgCsig(S) 

= -(-!) s * E (- 1 ) 39 

S geUP(i,),sgCsig(S) 

The last equality holds because we assume sig(S) ^ 0, hence sg C sig(S) is a non-empty ideal. 
Theorem D.9. JV(5) = -(-l) s *E S9e „ ?w , sg c s , 9 ( S) (-l) S9 - 

Next, assume that up(i/>) is generated by the set tp = {^i, ■ ■ ■ , 4> p }, where each factor 0; is a subset of [k]. Then we apply 
inclusion exclusion to (N6): 

N(S) = -(-i) 5 E G c 1 p](-i) |G_11 Eu ieG ^c S9 c sl9(S) (-i) fl9 - 

In the inner sum sg ranges over the interval [Uieafii, sig(S)], hence the sum is 

(_!)«s(S) when UigG = ^(5) and o 

otherwise. It follows: 

THEOREM D.10. iV(S) = (-l) S (-l)^ S )E sl9(G) = sl9(S) (-l) G - 

E. PROOF OF THE DICHOTOMY THEOREM 
E.l Unifiers 

In this section, we define a set H(q), called the set of hierarchical unifiers of q, by starting from the factors of q and unifying 
them in certain way. 

Definition E.l. [Hierarchical join predicate) Let qi and be two strict hierarchical queries with disjoint sets of variables 
and let g\ £ subgoals(qi) and g^ G subgoals{q2) be any two sub-goals that arc unifiable. Thus, gi and g-2. have same arity, say 
a. Let m u '■ Vars(gi) — > Vars(g2) be the most general unifier of g\ and gi, which is a bijection. Let xi C • • • C x a be all the 
variables in g\ and y\ C ■ ■ ■ C y a be all the variables in 32- Let w be the largest integer such that m u (xi) = yt for 1 < i < w. 
A hierarchical join predicate between qi and q?. is the set {(xi,m u (xi)) 1 < i < w} 

Definition E.2. (Hierarchical Unifier) Let qi and </2 be two strict hierarchical queries with disjoint sets of variables and 
let jp be some hierarchical join predicate between them. A hierarchical unifier of qi and q-z is a query obtained by considering 

««<-gi.«2, /\ (xi = xj) 

and removing all = predicates by substituting. 
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Lemma E.3. Let q u be a hierarchical unifier of two strict hierarchical queries qi and q 2 . Then, q u is a strict hierarchical 
query. 

Proof. TBD. □ 

The above result justifies the name "hierarchical unifier", because such unifiers are always hierarchical. Next we define a 
set TC(q), called the set of hierarchical unifiers of q, along with a function Factors from TL{q) to subsets of T(q). They are 
constructed inductively as follows: 

1. For each q £ J-(q), add q to TL(q) and let Factors(q) = {q}. 

2. If qi,q 2 are in H(q), and q u is their hierarchical unifier, add q u to W(q) if it is not logically equivalent to any existing 
query in TL(q). Also, define Factors(q u ) to be Factors(qi) U Factors(q 2 ) 

Lemma E.4. The set Ti(q) is finite. 

Proof. All queries in Ti are hierarchical by Lemma IE. 3 1 There are only finitely many hierarchical queries up to equivalence 
on a given set of relations and given set of constants. [[Expand this proof]]. □ 

E.2 The Polynomial Time Algorithm 

Let 7i.*(q) be the subset of H(q) containing queries which are either inversion-free or in J-(q). 

E.2. 1 Query expansion 
Let H* (q) = {qhi , qh 2 , • • • , qh k } . Define 

tj) = {S | S C [k], qCi C ((^J Factors{qhi)) for some qCi 6 C(q)} 

Thus, ip contains all combinations of hierarchical unifiers that make q true. Let Factors(ip) be the minimal elements of ip. 
Lemma E.5. With tp as defined above, 

q= \/ f\ qhi 

Proof. The <= direction is obvious from the definition of <j>. For the => direction, consider any mapping r\ of q into the 
database. Consider the factor corresponding to that mapping and the set of its connected components. This set is in in <f> 
and hence Vsg,/, Aigs l^ 1 * ' s true on the database. □ 

We then apply the generalized inclusion-exclusion formula from Sec [D] to obtain: 

Pr[q]= J2 (-l) |G| + |sl9(G)l £ ( — l) T Pr [qh(T)] 

acFactorsw {t| sis (t)= s19 (G)} 

where qh(T) = qhi (vri (T) ) , qh 2 (tt 2 (T) ) , ■ ■ ■ , qh k (n k (T) ) . 

Define coeff(sg) = (— 1)' S9 ' Y^GcFactorstifi) si 9 (G)=sg(~l)' G '' ^ ne sum can alternatively be rewritten as: 

Pr[q]= F( - s 9) 

sgC[K] 

where 

F(sg) = coeff{sg) ^ (-1) |T| Pr[qh(T)] 

{T\sig(T) = sg} 

E. 2. 2 Adding Independence Predicates 

Let afi, ■ • ■ , x k be the set of variables of qhi, • ■ ■ , qh k . Define new relational symbols Si, ■ ■ ■ , S k where the arity of Si equals 
\xi\. Given any join predicate jp between qhi and qhj, consider the following conjunctive query: 

Si(:ci), 5,(1,), A (Si-x = S j .y) 

Given any set T = (Ti, ■ ■ ■ ,T k ), let qj P (T) be the predicate which is true if qj V is true when evaluated on T, i.e. by setting 
Ti to be the instance of Si. 

An independence predicate is simply the negation of a join predicate. Let be the set of all independence predicates, i.e., 
Qip = { not (ljp) I <bv G Qjp}- 

We divide the join predicates into two disjoint sets, trivial and non-trivial. A join predicate between factors hi and hj 
is called trivial if the join query is equivalent to either hi or hj, and is called non-trivial otherwise. We write ip(C*) as 
ip n (C*) A ip*(C*), where ip n (C*) is the conjunction of not(jp) over all non-trivial join predicates jp, and ip*(C*) is the 
conjunction over all trivial join predicates. 
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For a signature sg, let ip™(s<;) denote the subset of consisting of independence predicates between all Si and Sj such 
that i,j G sg. Let jr be a function that maps each signature sg to a set of independence predicates it(sg) C ip n (s<;). Denote: 
Let 7r be any predicate on T, i.e. a query over the relations Si,- • • , Sk ■ Define 

sum(ix) = ]T N{sig{T))(-l) T Pr[qh(T)] 

T:w(T) 

Thus, the probability of q is simply sum(0), where is the predicate that is identically true. Define ip n to be the 
conjunction of all independence predicates between queries in Tt*(q), i.e. ip™ = \J jp not(jp), where jp ranges over all join 
predicates between all qi,qj G TL*(q). 

We will prove that when the query satisfies the PTIME conditions, then sum(0) = sum(ip n ). 

Definition E.6. (Eraser) Let qhi and qhj be any two strict hierarchical queries in H*(q) and let qij be their unifier 
corresponding to some join predicate jp. An eraser for the unifier qij is a set of queries E C TL*(q) such that: 

1. For all q G E, q — > qij 

2. For all sg C [fc], iV(s<? U {i, j}) = JV(sflr U {i, j} U {& | gfc* G £}). 

Theorem E.7. Suppose for every qt,qj,qij such that qi,qj G Ti.*(q) and qij is a hierarchical unifier of qt and qj, either 
qij G H*(q) or it has an eraser. Then, sum(0) = sum(ip n ) . 

We will prove Theorem IE. 71 in the rest of this section. 

Let N be the size of the domain for the database. Let 5 denote the vocabulary Si, ■ ■ ■ , Sk- Let Q.N,s(k) be the set of 
conjunctive queries of arity k over S that are equivalent on domain of size N. For each q G QN,s(k), define the following 

q* = 3x.q(x) A ( A not(q (x)) 

{q'W'£QN,s(k),q' contains q} 

Let Qjv.sW — {?* I Q £ Qjv.s(fc)} and let Q* N ^$ = Uk>oQ*N sW- Each of the query in Q* N ^$ is Boolean, hence it contains 
only finitely many queries up to equivalence on domains of size N, which we denote {qs*,qs2, ■ ■ ■ , qst}. For each qs* , qsi 
denotes the conjunctive query which is the positive part of qs* . 

We call each such query a cell. A cell signature is any subset of Q* N s . Given a cell signature csig, it defines the following 
query 

( /\ q)A( f\ not(q)) 

q£csig q^csig 

Given a set T, we say T \= csig if T satisfies the query defined by csig. The cell signatures partition the sets of all T. Thus, 
we have 

Pr\q] = ^7V( S i(?(T))(-l) T Pr[g/ i (T)] 

T 

= E E N(sig(T))(-lfPr[qh(T)] 

csig {T\T\=csig} 

We say that a cell signature csig contains a join predicate if their is a cell qs* G csig and a join predicate query qj P such 
that qs* C qj p . 

Lemma E.8. Let q be the union of all cell signatures that do not contain any join predicate. Then T \= q iff T satisfies all 
the independence predicates. 

Proof. □ 

To prove Theorem IE. 71 we only need to show that the total contribution of all cell signatures that contain at least one join 
predicate is 0. We will show this by grouping cell signatures into groups of three. 

Let F(csig) denote the quantity ^2 T ^ T i f=caig N(sig(T))(— 1) T Pr [qh(T)\. Let qhi, qhj be any two hierarchical queries with 
unifier q u corresponding to the join predicate jp. qj p is the join predicate query on the S vocabulary. Let E = {qh^ , • ■ ■ , qhi m } 
be its eraser. Thus, there is a mapping h : qh^ ,■■ ■ ,qhi m — ► q u . Let qj p ,B = h^S^),- • ■ ,h(Si m ),qj P . 

Let qSm be any query that contains qj p but not qj P ,E- Thus, there is a mapping g : qj p to qs m . Let f — h o g and let 
qs' m = f(S h ),--- ,f(Si m ),qs m . 

Let csigo be any subset of Qs that does not contain qs m and qs' m . 

Lemma E.9. With csigo, qs m and qs' m as defined above, 

F(csig U {qsm}) + F(csig U {qs' m }) + F(csig + {qs m , qs' m }) = 
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Proof. Consider any TO that satisfies either of the three cells. Then, TO satisfies the query qs m (note that qs' m contains 
the query qs m ). Let H ti be the set of tuples obtained for S; i from qs m (T0) using the mapping /. 

Let T" be obtained from TO by removing tuples Hi t from T\ i for all i in the eraser. Now we fix T' and look at all the T 
satisfying either of the three cells and which gives rise to the same T'. Every such T is obtained by adding some subset of 
H h toT{.. 

Claim: Every possible T obtained from T' by adding some subset of Hi i to T/. satisfies one of the three cells. 
Further, for a fixed T', each T gives rise to the same query qh(T). Thus, when we sum over all such T, we get an ideal 
which is 0. Summing over all T' , we get that the total contribution of the three cells is □ 

Lemma E.10. The set of all cell signatures that contain at least one join predicate can be partitioned into groups of three 
of the form in Lemma \E.9\ 

Proof. Each triplet is defined by (i) a join predicate qj p with an eraser E, (ii) a pair of queries qs a and qsb where qs a 
contains qj p but not qj P ,E and qsb is obtained from qs a by attaching E, and (iii) a subset of cells csigo- The triplet is then 
given by: csig U {qs a }, csigo U {qs b } and csigo U {qs a ,qs b }. 

Now, given any csig containing a join predicate, define qj p , qs a , qs b and csigo as follows. Order the set of all join predicates 
and the set of cells and pick a canonical eraser for each join predicate. Let qj p be the smallest join predicate in csig. Let E 
be the canonical eraser for qsi and let qj P ,E be the query as described above. 

Let qs m be the smallest cell in csig that contains qj v . If qs m does not contain qj P ,E, let qs a — qs m and define qsb 
appropriately. Note that csig may not contain qsb- If qs m contains qj P ,E let qs b = qs m and define qs a appropriately. Again, 
csig may not contain qs a - Let csigo be all the cells in csig except qs a and qs b . This defines the triplet for qs m . 

Claim: every cell signature containing a join predicate belongs to a unique triplet. 

This follows from Lemma lE.111 □ 

Lemma E.ll. Let qi and qj be two queries with a join predicate q u that has an inversion. Suppose E ts an eraser for q u , 
such that there is a mapping h : E — > q u . Then, for any qi € E, h(qi),qt is hierarchical. 

Proof. Suppose on the contrary there is an inversion between R(x),S(x,y) £ qi and S(x' ,y'),T(y') in qi such that 
h(x) = x' , h(y) = y , where R(x) is some subgoal that contains x but not y, S(x,y) is some subgoal containing both x and 
y, S(x' , y') is a subgoal containing both x' and y' and T(y') is a subgoal containing y but not x' . 

There are two cases: the join predicate between qi and qj does not touch variable x' in qj. Then, no subgoal of qi in q u 
contains the variable x' . So, h maps R(x) to some subgoal in qj itself. Thus, qj is not hierarchical, which is a contradiction. 

Hence, the join predicate between qi and qj uses the variable x' . It also uses y' because i'Ci. Now, since qi and qi have 
inversion, there must be an eraser E' that has a mapping to h(qi), qi. This eraser only uses a portion of the partial unifier of 
qi , qj , hence there is a mapping E' — > q u . □ 

E.2.3 Change of Basis 

We have 

Pr[q]= N(sig(T))(-l) T Pr[qh(T)] 

T:ip"(T) 

For each i, we expand qhi(Ti) into the relations it contains. We group all the T that result in the same qh(T). 

Each qhi is a connected hierarchical query. Let C be the hierarchy relation on Vars(qhi). Define a hierarchy tree for qhi 
as follows. The nodes of the trees are certain subsets of Vars(qhi). For each subset of the set of subgoals of qhi, there is a 
node in the hierarchy tree consisting of the intersection of variables of those subgoals. A node n is a child of n' if n C n' and 
there is no n" such that n Cn" C n' . 

For each node in the hierarchy tree of qhi, we define a new relational symbol whose attributes are the variables in that 
node. Let S 1 — {So, S{, ■ ■ ■ } be the set of new relational symbols and let {Xq, Xq, • • •} be the corresponding sets of variables. 

Consider any vector Ui = (Ui 1 ,Ui 2 , ■ ■ ■), where Ui j C A Artt y( s j\ \y e sav that T |= U if for all i,j, Ui j is the projection 
of Ti on the variables X). Define Fi(Ui) = H g \vars( g )=x* P^qhiiU^)] and let F(U) = F^Ui) x ■ ■ ■ x F k (U k ). Then, if 
T \= U and T satisfies all the independence predicates, we have Pr[qh(T) = F(U). 

We rewrite Pr[q] as 

Pr[q] = J2 E N(sig(TM-l) T Pr[qh(T)] 

U {T|!7|=T,ip™(T)} 

The signature of T can be determined by the signature of U in straightforward way, and we write N(sig(T)) as N(sig'(U)). 
Also, we write Pr[qh(T)] as F(U). We have 

Pr[q]=J2N(sig'(U))F(U) £ (-1) T 

V {T|(7^T,ip"(T)} 

Next, we note that ip n (T) is independent of T for a given U, and we move the independence predicates to U as follows: 
For each independence predicate between sub-goal gi_ of qh ix and sub-goal 32 of qh i2 , we add the independence predicate 
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not^J (x), S l ? (a;)), where SjJ is the relation corresponding to Vars(gi) and S 1 ^ is the relation corresponding to Vars(gi2. 
Let ip n (U) be the conjunction of all such predicates. Then, 

Pr[q] = J2 N(sig'(U))F(U) ]T (-if 

{U\i f "(U)} {T\U^T} 

Not all possible U have a possible T. For instance, if relations S^i an d Sj 2 share a set of variable X, then 17 must have 
■kx(S]^) — Tvx{Sj 2 ). We use the hierarchy tree to determine when a U has a possible T. For each S)^ and 5j 2 such that S] l 
is a child of S] 2 , define the predicate S] 1 = -Kx(Sj 2 , where X is the set of variables in S) 1 . Let <f> be the conjunction of all 
such predicates on U. 

Lemma E.12. Let f(UJj) be a function which is ( — 1)'^' if Sj has even number of children in the hierarchy tree of qhi and 
1 otherwise. Then, X){t|{/|=t} (— 1)' T ' = Tlijf(Uj) tfU satisfies 4> and otherwise. 

Using the above lemma, we get 

Pr[q] = J2 N(sig'(U))f(U)F(U) 

{U\± V "(U),4>(U)} 

= E f(U)F(U) 

sig {U\±j> rl (U),4>(U),sig(U)=sig} 

Next, we add the remaining independence predicates, namely ip'. Consider all pairs S^ and Sj 2 in the query that refer 
to the same predicate and which have not been separated using ip™. Fix an ordering on the subgoals of the query, and let 

giUj) be a function which is (— ly u J' if there are odd number of subgoal less than S) that need to be separated from Sj and 
1 otherwise. Then, 

Pr[q] = ^N'(sig) £ g(U)f(U)F(U) 

sig {U\if n (U),ij> t (U),4>(U),sig(U) = sig} 

We observe that computing the inner sum is equivalent to evaluating the query (ip n (£7) A ip*(£/) A 4>{U) A sig(U) = sig) on a 
probabilistic database with schema Sj and instance U] and probabilities given by Pr[t £ Sj] = g(t)f(t)F(t)/(l + g(t)f(t)F(t)). 

Finally, to evaluate (ip n (!7) A ip'((7) A 4>{U) A sig(U) = sig), we negate ip n and use inclusion-exclusion to represent it 
as probabilities of finite number of conjunctive queries (with negated subgoals due to (p. Each such conjunctive query is 
inversion-free [[need to give more details here]] , because the ip n A ip part consists of a bunch of join predicates corresponding 
to hierarchical unifiers, and the <j> part also contains the same join predicates (but with negated sub-goals). So the resulting 
query is inversion- free and can be evaluated in PTIME. 

E.3 Hardness Proof 

The main result of this section is that if there is a hierarchical unifier that contains an inversion but does not have an eraser, 
then the query is #P-hard. This shows that the PTIME condition and the hardness condition complement each other. 

Theorem E.13. Let qi,qj G T~L*{q) and let qt be their hierarchical unifier qu such that 

1. qk contains an inversion. 

2. qk does not have any eraser. 
Then, q is #P '-complete. 

We prove this in the rest of this section. First, we need some definitions and results. 

Definition E.14. (Redundent Set of Covers) A set of covers gci,--- , qct is strictly redundant if there exists a mapping 
h : qc — > qc\ , • • • , qc k , where qc is not among qc\ , • ■ ■ , qc k . A set of covers is redundant if it contains a strictly redundant 
subset of covers. 

Definition E.15. Let qco, ■ ■ ■ , qc k be a non-redundant set of covers. Let qcs <— qc , • • • , qc k and define the cover-set query 
to be the minimization of qcs : 

qcs — minimize(gcs) = qc' , • • • , qc' k 

where each qc'i is a subset of subgoals of qa. Denote the inclusions and the projection homomorphisms: 

irii : qci — > qcs i = 0, 1, • • • , k 
in : qcs' — > qcs 
pr : qcs — » qcs' 

Note that pr o in is the identity mapping on qcs' . 
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Definition E.16. The mappings hi : q — > qcs' obtained by composing h : q — * qa (the cover mapping), im (the i th 
inclusion) and pr (the projection) are called canonical mappings. 

Lemma E.17. If F is a non-redundant set of covers, then every mapping from q to the cover-set query of F is canonical 
upto isomorphism. 

Definition E.18. (Extension) Let qh £ H(q) be any hierarchical unifier with Factors(qh) = {qfi, ■ ■ ■ ,qfk}- Letqci,--- ,qcu 
be a multiset of covers such that qci contains the factor qfi. An extension of qh is a query qce' obtained by minimizing 
qce — qc\, ■ ■ ■ ,qck,qh. Define the inclusion homomorphism in : qce 1 — + qce, the i th inclusion homomorphism im : qci —* qce 
and the projection homomorphism pr : qce — ► qce' in the natural way. Also, define canonical mappings for extensions as we 
defined it for cover-sets above. 

Now some hardness results. 

Lemma E.19. Let C = {qa,--- ,qc k } be a non-redundant set of covers such that their cover-set qcs has an inversion. 
Then, q is ffP-hard. 

Proof. Without loss of generality, we can assume that for any proper subset of C, the cover-set does not have an inversion 
(otherwise we replace C with the smaller set and repeat the argument). 
Let the inversion in qcs consist of 

go(x), h (x,y),gi(ui,vi),hi(ui,vi), ■ ■ ■ ,g n -i(un-i,v„-i), h n -i(u n -i, v„-i),g n (x', y), h n (y') 

where subgoals hi and gi+i refer to the same relation. For each qa £ C, define the type of qa as the subset of [n] consisting 
of all t such that the image of qci under the pr homomorphism contains the subgoals gt,h t . 
Claim: for each qa, its type contains at least one t which is not present in any other type. 

This follows from the minimality of the set F, because if qa does not contribute any unique t, then we can remove if from 
F and still get an inversion in the cover-set query. 

[[Next use the inclusion-exclusion on the types, and argue that exactly one conjunct of types is #P-hard (namely, one that 
contains all the types. Use this to give a reduction from RSSS..ST query]] 

□ 

Lemma E.20. Let qh £ H(q) be a hierarchical unifier that has an extension qce such that all the mappings from q — + qce 
are canonical. Then q is ffP-hard. 

Proof. We use the extension qce to find a non-redundunct set of covers whose cover-set has an inversion. 

Let {qci, • ■ ■ , get,} be the set of covers used in the extension of qce. Let ft be a mapping that maps each variable that docs 
not participate in the inversion to a unique contant. Construct a new set of covers F' = {qc[, • • ■ , qc' k } where qc'i = h(qa). 
Note that the resulting queries are indeed covers. We will show that F' is a non-redundunct set of covers whose cover-set has 
an inversion. 

It is easy to see that the cover-set of F' is precisely the query h(qce). Since h does not touch the variables that participate 
in the inversion, h(qce) also contains an inversion, and so does h(qce). To prove that F 1 is non-reduntant, we note that a 
non-canonical mapping from some cover qc to the cover-set of F' gives a non-canonical mapping from a different cover qc' 
(obtained by replacing the new constants back by varaibles) to the extension qce. This is a contradiction, hence all mappings 
into the cover-set of F' are canonical. So F' is non-reduntant. □ 

Lemma E.21. Let qh £ H(q) be a hierarchical unifier that does not have an eraser. Then, there is an extension qce such 
that all the mappings from q — » qce are canonical. 
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